- set
config/pipeline_config.json
- run
make run
- check
output/
for processed files
config/pipeline_config.json
is the main config file needed to run this thing.
- run the following to copy the example config to get the structure:
cp config/pipeline_config.json.example config/pipeline_config.json
- then update file with actual configurations (ask paul for now—will add to bitwarden vault when done)
Run scripts are based out of the Makefile
(make sure you have Docker running
in the background)
- to run:
make run
- to run just audio processing:
make run-audio
- to run just video processing:
make run-video
- to run both sequentially:
make run-both
There are times when the script will fail at the 'Downloading Youtube' step. In order to sidestep this, follow these steps:
-
Download the audio and videofile on your own, and add it to the
cache
folder.- Helper scripts:
- my audio download one-liner:
yt-dlp -f "bestaudio" -x --audio-format m4a -o "%(upload_date>%Y-%m-%d)s.%(ext)s" "https://youtube.com/live/g0jKUy1hV4w"
- my video download one-liner:
yt-dlp -f "bestvideo+bestaudio/best" -o "%(upload_date>%Y-%m-%d)s.mp4" --merge-output-format mp4 "https://youtube.com/live/g0jKUy1hV4w"
- my audio download one-liner:
- Helper scripts:
-
In
config/pipeline_config.json
, flip themanual_download
boolean totrue
, and add in the appropriate paths to the files to themanual_path
fields.-
eg: In
config/pipeline_config.json
:{ ... "manual_download": false, ... "audio": { "manual_path": "cache/audio/2025-02-02.m4a", ... }, "video": { "manual_path": "cache/video/2025-02-02.mp4", ... } }
-
All tests are held in tests/
, and are run through pytest
- to run all tests:
make test
- to run a single test:
make test TEST_FILE=test_filename.py
- Running
make run
looks in theMakefile
and runs therun
command make run
kicks offdocker-compose
, which builds a container based of theDockerfile
- The running container calls the startup script at
scripts/startup.py
Selecting to process the audio kicks off scripts/run_audio_pipeline.py
, which:
- Validates that the config file is correct which builds the main audio
- Creates the PipeLineData dataclass, which is the main object that tracks the paths of the different intermediate steps of the pipeline
- Calls
app/pipelines/audio_pipeline.py
which is the meat of the processing—create_audio_pipeline()
composes and returns a pipeline made of discrete "steps" (found inapp/steps/
) that will iteratively run and process the piece of content as we need.
One problem that we had was re-running the script, only to download the same
files we already had all over again. In order to solve this, we created
app/downloaders/downloader_proxy.py
. What this does is, during the downloading
step, before we actually download the file, we predict what the resulting output
filepath is going to be, and if it already exists, we just use that filepath,
and skip the download. This should save us a bunch of time on script re-runs.
app/cache/
holds the downloaded and intermediary files.
ffmpeg
is the main file processing engine. To run it in the different steps,
we construct a string array of commands and flags, then call it directly as a
system subprocess. In the constructed flags the following are useful to know:
-crf
- Constant Rate Factor- usually set to: 16
- "The range of the quantizer scale is 0-51: where 0 is lossless, 23 is default, and 51 is worst possible. A lower value is a higher quality and a subjectively sane range is 18-28. Consider 18 to be visually lossless or nearly so: it should look the same or nearly the same as the input but it isn't technically lossless."
-preset
- usually set to: ultrafast
- "These presets affect the encoding speed. Using a slower preset gives you better compression, or quality per filesize, whereas faster presets give you worse compression. In general, you should just use the preset you can afford to wait for. Presets can be ultrafast, superfast, veryfast, faster, fast, medium (default), slow and veryslow."
We've set the default to crf=16
and preset=ultrafast
because we're ok with
the larger resultant filesize, as we're generally just going to upload it to
youtube and then delete it.
After the pipeline runs, check the output/{stream_id}/
dir for the finished
file (stream_id
is whatever you set in the config file)