Skip to content

Commit

Permalink
docs(project): revised project work to have clearer tasks (#35)
Browse files Browse the repository at this point in the history
  • Loading branch information
lwjohnst86 authored Nov 30, 2024
1 parent 3093dea commit 93f403c
Show file tree
Hide file tree
Showing 2 changed files with 53 additions and 31 deletions.
77 changes: 47 additions & 30 deletions appendix/project.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,20 @@

To maximize how much you learn and how much you will retain, you as a
group will take what you learn in the course and apply it to create a
reproducible project. This project ...
reproducible project within a server environment. This project ...

We will have created a project folder to work in on the server as well
as assigning you and your team mate(s) to the project. Within your
project, you will also be given a specific set of outputs to create (a
figure, a table, and a basic report), based on the data we provide to
you. The outputs will not require using all of the data given.

<!-- TODO: Include a panelset of specific data and their outputs? -->

During the last session of the course you will work on this assignment.
In the last \~20 minutes of this session, the lead instructor will ...
and re-generate your report to check that it is reproducible.
In the last \~20 minutes of this session, the lead instructor will go
into your projects and re-generate your report to check that it is
reproducible.

## Specific tasks

Expand All @@ -16,41 +25,49 @@ quickly start collaborating together on the project.

Your specific tasks are:

Sequence of steps for project:

- Starting point:
- Learning how to identify what file storage format (e.g. csv or
SAS dataset) there are and knowing how convert those files into
more efficient formats (like Parquet or a SQL database)
- Give them a few server environment types, and the same data but
with different starting formats. And then they figure out the
next steps based on that information
- Multiple data is big enough to prevent doing it normal way (1 Gb
or larger?)
- Explaining why the original data format might not be ideal and then
converting the data into more efficient format
- Identify what the desired sample is for the dataset, only select and
filter data they need for analysis
- Split the data into smaller chunk to prototype code (running code on
all the data later)
- Run basic analysis (descriptive statistics)... Not modeling
- Implement some code to run with parallel processing
- Identifying which format data or items can be downloaded, and
converting that to that format

Assumptions:

- Assume they have taken the intermediate course (need to know
functionals and function-based workflows), and either have read or
taken the advanced course or are familiar enough with targets
1. Review the outputs we want you to create as well as the specific
data needed for creating them. Keep these in mind for later tasks.
2. Identify what resources you have available in the server
environment, such as the number of cores and amount of memory. Use
this information to guide your coding.
3. Look into the ... folder that contains the raw data you will use for
this project. Identify which file storage format the data is saved
in and convert it into a more efficient method if necessary.
Depending on which format it is, save the converted format into the
... folder. Make use of parallel processing to speed up this (using
`{furrr}`).
4. Read in a subset of the data that only contains the columns and rows
you need for your outputs. Randomly keep a slice of this data to use
for prototyping code later.
5. Working backwards, write out a set of (empty) functions that provide
the sequences of steps that will create a specific output. Begin
filling in these functions, making sure they work before adding them
to the `{targets}` pipeline. Write the functions and `{targets}`
pipeline so that it will run things in parallel.
6. Incorporate the outputs into a report, include that in the
`{targets}` pipeline.
7. Comment out the line of code that randomly keeps a slice of the data
and then run the `{targets}` pipeline with `targets::tar_make()`.
8. The project will be complete if you can regenerate all the outputs
using only the `{targets}` pipeline.

## Quick "checklist" for a good project

## Expectations for the project

What we expect you to do for the group project:

- Use parallel processing.
- Use a `{targets}` pipeline.
- Use functional programming (including creating your own functions).
- Use an efficient file storage format.

What we don't expect:

- No complicated analyses.
- No complicated figures or tables.
- No processing that isn't specifically related to creating the
assigned output.

Essentially, the group project is a way to reinforce what you learned
during the course, but in a more relaxed and collaborative setting.
7 changes: 6 additions & 1 deletion preamble/syllabus.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -69,7 +69,12 @@ To help manage expectations and develop the material for this course, we
make a few assumptions about *who you are* as a participant in the
course:

- Assumptions
Assumptions:

- This course builds off of the content found within the intermediate
course (specifically, using functionals and function-based
workflows) and the advanced course (specifically, using `{targets}`
to build pipelines).

While we have these assumptions to help focus the content of the course,
if you have an interest in learning R but don't fit any of the above
Expand Down

0 comments on commit 93f403c

Please sign in to comment.