Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Henryh/favyen/forest loss 20240917 #67

Merged
merged 241 commits into from
Jan 8, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
241 commits
Select commit Hold shift + click to select a range
2eeb35e
Start implementing prediction and fix metric bug.
uakfdotb Sep 17, 2024
7db79a0
forest loss driver: save best model based on accuracy metric not val …
uakfdotb Sep 18, 2024
bf7df3d
fix prediction website
uakfdotb Sep 18, 2024
38605c0
improve probability display in web app
uakfdotb Sep 19, 2024
c9f44f3
move window name sorting to find_windows_with_outputs so we can reuse it
uakfdotb Sep 19, 2024
53a20a3
add script to create geojson of forest loss events
uakfdotb Sep 19, 2024
4f91c48
Add Planet Labs images to the forest loss driver prediction output.
favyen2 Oct 1, 2024
b86e980
remove rslearn_entrypoint that isn't needed anymore
favyen2 Oct 15, 2024
a13cf9e
Add map visualization and README
uakfdotb Oct 16, 2024
64eff47
Add legend
favyen2 Oct 17, 2024
54449b6
Merge remote-tracking branch 'origin/master' into favyen/forest-loss-…
favyen2 Oct 22, 2024
010feb8
Merge branch 'master' into favyen/forest-loss-20240917
Hgherzog Oct 24, 2024
43f16da
Merge branch 'master' into henryh/favyen/forest-loss-20240917
Hgherzog Oct 24, 2024
b72a2ef
add logging
Hgherzog Oct 24, 2024
eb1adf0
add unit test for event writing
Hgherzog Oct 25, 2024
36396b9
initial refactor of write event for readability
Hgherzog Oct 25, 2024
ed275ac
wip starting to set up tests
Hgherzog Oct 25, 2024
a6c98db
wip unit tests
Hgherzog Oct 29, 2024
3598557
add unit tests for the forest loss predict pipeline
Hgherzog Oct 30, 2024
f33f093
unit test updates
Hgherzog Oct 30, 2024
10f5999
first step of the pipeline now has integration test
Hgherzog Oct 30, 2024
72cf286
notes on fromatting all unit and integration tests pass
Hgherzog Oct 30, 2024
cf7d947
need a better way to handle this env var passing
Hgherzog Oct 30, 2024
5376912
wip
Hgherzog Nov 1, 2024
d342773
Merge branch 'master' into henryh/favyen/forest-loss-20240917
Hgherzog Nov 1, 2024
7082119
refactor and split out parts of the inference pipeline
Hgherzog Nov 1, 2024
56d0cb4
add data to test unmaterialized dataset
Hgherzog Nov 1, 2024
153f626
only save the unmaterialized dataset
Hgherzog Nov 1, 2024
b6dff93
add materialize integration test
Hgherzog Nov 1, 2024
5d73039
make tiles dummy files
Hgherzog Nov 1, 2024
e069c55
add tests for best times output file
Hgherzog Nov 1, 2024
340000b
update_gitignore
Hgherzog Nov 1, 2024
ca16d48
remove accidentlaly uploaded files
Hgherzog Nov 1, 2024
1d38afd
track tiff files with git lfs
Hgherzog Nov 1, 2024
664c61c
add tiles data to lfs storage
Hgherzog Nov 1, 2024
c1788d2
integration tests working for every component
Hgherzog Nov 1, 2024
7b0dee2
clean up predict interface and tests
Hgherzog Nov 2, 2024
a67115e
wip
Hgherzog Nov 2, 2024
bc0b1d2
make a configurable max number of events to extract
Hgherzog Nov 5, 2024
8131020
fix typo
Hgherzog Nov 5, 2024
98f4e8f
add logger to lightning cli
Hgherzog Nov 5, 2024
5cccbf4
full pipeline fix typo error
Hgherzog Nov 5, 2024
59b62fe
end to end tests working
Hgherzog Nov 5, 2024
67e8eef
add env variable for integration test
Hgherzog Nov 5, 2024
e24ed5e
Attempt to fix circular import
Hgherzog Nov 5, 2024
f8e9bb6
make all inputs to the pipeline be given from a single config
Hgherzog Nov 5, 2024
896569a
wip of making the predict pipeline more configurable
Hgherzog Nov 5, 2024
ebdcec3
initial config wip
Hgherzog Nov 5, 2024
408cd79
enable repo relative path for more things
Hgherzog Nov 5, 2024
b46e2e1
wip before getting to launch on beaker
Hgherzog Nov 5, 2024
43db48d
split out parts of the pipeline
Hgherzog Nov 14, 2024
04f5405
enable sequential calling of the pipelines manually
Hgherzog Nov 14, 2024
12e4007
add .github for weekly amazon conservation predicitons
Hgherzog Nov 15, 2024
4ea98c9
add .github for weekly amazon conservation predicitons
Hgherzog Nov 15, 2024
3acc742
simplify names
Hgherzog Nov 15, 2024
e9828ac
add to forest loss driver
Hgherzog Nov 15, 2024
12cdd2f
make default boot disk size smaller
Hgherzog Nov 15, 2024
727ff68
trying on push
Hgherzog Nov 15, 2024
fe3b0be
debugging github actions
Hgherzog Nov 15, 2024
f7f971a
make sure cli works
Hgherzog Nov 15, 2024
9506f08
bump docker image so we don't double download nvidia with rslearn
Hgherzog Nov 18, 2024
ec16593
got predict job to launch on beaker
Hgherzog Nov 19, 2024
d333514
build the wheel for the project in the docker image
Hgherzog Nov 19, 2024
79f3d94
fixed docker image
Hgherzog Nov 19, 2024
d7a1db6
update deploy script
Hgherzog Nov 19, 2024
ecab24a
use relative imports
Hgherzog Nov 19, 2024
4fded88
switch all to relative imports
Hgherzog Nov 19, 2024
5e5ed66
experimenting with image caching
Hgherzog Nov 19, 2024
84eb69d
remove cache stuff
Hgherzog Nov 20, 2024
ee153de
Merge branch 'master' into henryh/favyen/forest-loss-20240917
Hgherzog Nov 20, 2024
917c069
add missing init file
Hgherzog Nov 20, 2024
4554054
remove pycahce
Hgherzog Nov 20, 2024
5b1ab9a
update how we import
Hgherzog Nov 20, 2024
c4c68c1
add missing files
Hgherzog Nov 20, 2024
c7bf247
remove pycache
Hgherzog Nov 20, 2024
1beeb37
add cropped testing data
Hgherzog Nov 20, 2024
6cb59d9
include cropped tifs in docker
Hgherzog Nov 20, 2024
6b56370
cropped
Hgherzog Nov 20, 2024
21b8963
point to crops in bucket
Hgherzog Nov 20, 2024
c4395c1
wip full deploy
Hgherzog Nov 22, 2024
57261b4
add more debug logging to vm startup
Hgherzog Nov 22, 2024
c8a8fc7
fix misconfigured test
Hgherzog Nov 22, 2024
d27967a
fix lint errors
Hgherzog Nov 22, 2024
de2b932
fix lint errors
Hgherzog Nov 22, 2024
5118826
make each image push unique
Hgherzog Nov 22, 2024
2005df4
pull ancd launch script is now working
Hgherzog Nov 22, 2024
93f470e
update vars that need not
Hgherzog Nov 22, 2024
5791a72
properly pass docker image
Hgherzog Nov 22, 2024
81f44a2
debug getting the image name out of the build step
Hgherzog Nov 22, 2024
84f17b5
Additional debugging for ghcr output name
Hgherzog Nov 22, 2024
68a19a6
more debugs
Hgherzog Nov 22, 2024
eb3294a
fix confusing variable naming
Hgherzog Nov 22, 2024
645a2b0
fix beaker file formatting
Hgherzog Nov 22, 2024
055161f
re add quotes
Hgherzog Nov 23, 2024
96626b4
docs
Hgherzog Nov 23, 2024
9b05f7a
improve docs
Hgherzog Nov 23, 2024
5367494
Add shapefile to local data
Hgherzog Nov 23, 2024
697616f
fix more tests
Hgherzog Nov 23, 2024
47147f1
fail with an error in the bash script
Hgherzog Nov 23, 2024
bfe5939
add debugs to figure out which account is in use
Hgherzog Nov 23, 2024
2377758
switch permissions for creating vm
Hgherzog Nov 25, 2024
b356b9a
add additional permissions
Hgherzog Nov 25, 2024
876d86c
make startup script sequential
Hgherzog Nov 25, 2024
25a9d43
regen pat
Hgherzog Nov 25, 2024
3582614
fix secrets
Hgherzog Nov 25, 2024
e7a81b9
move secrets to google secrets manager
Hgherzog Nov 25, 2024
711aabc
ensure planet data works as expected
Hgherzog Nov 27, 2024
467b780
add max exitence time for the instance
Hgherzog Nov 27, 2024
a7a2ee3
trying a run on 1 geotiff with more reasonable config params
Hgherzog Nov 27, 2024
5b3b1c2
remove forest loss driver test_data from local files
Hgherzog Nov 27, 2024
b5cce55
have the root set dynamically per day
Hgherzog Dec 2, 2024
0076bfb
fix required fields and data download for tests
Hgherzog Dec 2, 2024
236f010
remove gitattributes
Hgherzog Dec 2, 2024
3607daa
make the name valid
Hgherzog Dec 2, 2024
74bcf36
use biggerr runner for tests
Hgherzog Dec 2, 2024
053bce8
reduce num workers due to OOM
Hgherzog Dec 2, 2024
cb16b3c
try on bigger machine and remove failing lifetime param
Hgherzog Dec 2, 2024
a14cb57
Trying to speed up shape generation
Hgherzog Dec 2, 2024
0e8c465
add more loggging for debugging
Hgherzog Dec 2, 2024
734adfe
Add more debugging
Hgherzog Dec 2, 2024
fb4110d
check if tqdm is causing the buffer problem
Hgherzog Dec 2, 2024
b3078c1
use prefix directly and give tqdm a min interval
Hgherzog Dec 3, 2024
e6b3081
also run without tqdm
Hgherzog Dec 3, 2024
83100a0
launching job across entire region
Hgherzog Dec 3, 2024
17fb61e
add extra logging and clean up tools
Hgherzog Dec 3, 2024
72a9833
remove visualization code that lives elsewhere
Hgherzog Dec 3, 2024
c1ef849
make sure rslp_prefix is appropriately included
Hgherzog Dec 3, 2024
3e88e87
bump debug version
Hgherzog Dec 3, 2024
dccc23c
try bigger machine also remove a second tqdm
Hgherzog Dec 4, 2024
d68058b
pipe the output logs to a different location to avoid overwhelming th…
Hgherzog Dec 4, 2024
b704eed
try redirecting all the stdout and stderr to a file rather than the t…
Hgherzog Dec 4, 2024
4d23da5
fix typo
Hgherzog Dec 4, 2024
7a58a26
try with fewer workers to see if concurrency caused bug
Hgherzog Dec 5, 2024
732c088
run without visualization layers
Hgherzog Dec 5, 2024
74354be
Merge branch 'master' into henryh/favyen/forest-loss-20240917
Hgherzog Dec 5, 2024
f97b597
get verbose output for failing tests
Hgherzog Dec 5, 2024
81ce5a8
add more debug logging
Hgherzog Dec 5, 2024
a941175
add more debugging
Hgherzog Dec 5, 2024
80b78ae
add more debugging info
Hgherzog Dec 5, 2024
a3a48a9
rerun no vis with num workers bumped and max time increased
Hgherzog Dec 8, 2024
d63d942
try again
Hgherzog Dec 9, 2024
fe3cdbb
debug 1
Hgherzog Dec 9, 2024
a4cd14d
debug 2
Hgherzog Dec 9, 2024
a0680e4
run ingest on second tiff
Hgherzog Dec 9, 2024
81ebbc1
3 geotif debug
Hgherzog Dec 9, 2024
234aef9
debug 4
Hgherzog Dec 9, 2024
11f93f8
add ignore errors and resubmit the job
Hgherzog Dec 10, 2024
c46031c
fix lints
Hgherzog Dec 10, 2024
d25eb42
keep the unmaterialized dataset local and download the rest
Hgherzog Dec 10, 2024
84112b7
remove pycache
Hgherzog Dec 10, 2024
f35a9cc
remove predict while adding tests
Hgherzog Dec 10, 2024
3f6de5c
Still trying to debug
Hgherzog Dec 10, 2024
2c1ecfa
update dockerignore to fix tests
Hgherzog Dec 10, 2024
46827b4
update tests to reflect rslearn changes
Hgherzog Dec 10, 2024
8e71c10
update logging config
Hgherzog Dec 10, 2024
8622210
re enable more tests in ci
Hgherzog Dec 10, 2024
0f7be55
update output probs
Hgherzog Dec 10, 2024
e6cfaa8
tests pass and launch the whole thing
Hgherzog Dec 10, 2024
bae5df9
Remove hard coded image name
Hgherzog Dec 10, 2024
cddc500
add better logging for model outputs
Hgherzog Dec 10, 2024
b3b17d8
fix infrastructure filter unit tests
Hgherzog Dec 10, 2024
67eba6f
fix lint errors
Hgherzog Dec 10, 2024
b92f555
test inference steps only
Hgherzog Dec 10, 2024
dbe4000
fixed test data
Hgherzog Dec 10, 2024
aaed5f5
add Reamd Files to start documenting the forest loss driver project
Hgherzog Dec 10, 2024
07c14b7
add more documentation
Hgherzog Dec 10, 2024
b838379
mark more todos for documenting
Hgherzog Dec 10, 2024
d2a5914
only check model
Hgherzog Dec 10, 2024
5aed7fb
deployment docs
Hgherzog Dec 11, 2024
95dc216
add extra step to model predict
Hgherzog Dec 11, 2024
57facea
run model predict step on gpu
Hgherzog Dec 11, 2024
a019fbb
allow forest loss driver to use nvidia gpu
Hgherzog Dec 11, 2024
d44b4a2
use gpu on action
Hgherzog Dec 11, 2024
8e3be94
try again run on gpu
Hgherzog Dec 11, 2024
1677953
check gpu only for pytest again
Hgherzog Dec 11, 2024
9da5ab0
remove debugging code
Hgherzog Dec 11, 2024
2ee9c8f
decrease tolerance
Hgherzog Dec 11, 2024
c2e2e54
make num workers in tests more dynamix
Hgherzog Dec 11, 2024
991097d
try running predict pipeline again
Hgherzog Dec 11, 2024
97091ee
more eexplicit docker build context
Hgherzog Dec 11, 2024
f5f9d13
more cleanups
Hgherzog Dec 11, 2024
da4d73a
allow all tests to run to see how things go
Hgherzog Dec 11, 2024
15b9d87
fix broken load country path
Hgherzog Dec 11, 2024
fcc4734
add back the gpu enabled runner stuff
Hgherzog Dec 11, 2024
dd98837
reduce number of workers in hopes test don't stall
Hgherzog Dec 11, 2024
26f7989
remove time sleep bug
Hgherzog Dec 11, 2024
a91dc0b
add ignore errors to materialize handler
Hgherzog Dec 12, 2024
2e93385
revert messed up tiff
Hgherzog Dec 12, 2024
a6d6f2d
bigger machine and use more workers
Hgherzog Dec 13, 2024
f935fe1
relaunch forest loss driver with correct date stuff
Hgherzog Dec 16, 2024
13491b4
run ops agent
Hgherzog Dec 17, 2024
50909d6
run ops agent
Hgherzog Dec 17, 2024
fa9a440
add args model and allow all optionality for prepare ingest materialize
Hgherzog Dec 17, 2024
43dacb6
try with abtch size 4
Hgherzog Dec 17, 2024
6f697e1
try with abtch size 4
Hgherzog Dec 17, 2024
c2aa500
new configuration set up for predict pipeline
Hgherzog Dec 18, 2024
575f228
make notes to fix predict pipeline config to use jsonargparse for later
Hgherzog Dec 18, 2024
2b1e392
fix linting errors and use pipeline args for other pipelines
Hgherzog Dec 19, 2024
b6aba91
increase number of workers dramatically
Hgherzog Dec 19, 2024
98cbcfb
get values to load correctly in pipeline, use env vars add some tests
Hgherzog Dec 19, 2024
1312ea1
Add paralelism and workers for many steps
Hgherzog Dec 19, 2024
410ff6e
use get default workers in more palces
Hgherzog Dec 19, 2024
bcdcdbf
move to new dir
Hgherzog Dec 19, 2024
3dee746
readme updates (implement todo)
favyen2 Dec 19, 2024
728b2d4
Merge branch 'henryh/favyen/forest-loss-20240917' of github.com:allen…
favyen2 Dec 19, 2024
6143934
allow relative imports for config
Hgherzog Dec 19, 2024
1563d05
use jsonargparse downstream
Hgherzog Dec 19, 2024
8027055
Merge branch 'henryh/favyen/forest-loss-20240917' of https://github.c…
Hgherzog Dec 19, 2024
d39bde2
remove the from yaml stuff
Hgherzog Dec 19, 2024
2c2209f
fix deploy command
Hgherzog Dec 20, 2024
20541fa
run ci with rslearn fixes
Hgherzog Dec 20, 2024
ba37902
run ci with rslearn fixes
Hgherzog Dec 20, 2024
c718d1d
fix predict laucnh
Hgherzog Dec 20, 2024
6fc9b38
remove temp changes and fix arg passing
Hgherzog Dec 20, 2024
b011caf
update cli
Hgherzog Dec 20, 2024
cceb430
fix deploy argds
Hgherzog Dec 20, 2024
1631a63
close files and rework order
Hgherzog Dec 20, 2024
289a67f
add check
Hgherzog Dec 20, 2024
0fcf5e8
raise error if no events are found
Hgherzog Dec 20, 2024
65d030e
Switch tiffs
Hgherzog Dec 20, 2024
e8cc4ce
add back config
Hgherzog Dec 20, 2024
403c728
address best image name and vm deploy arg
Hgherzog Dec 20, 2024
d341671
address best image name and vm deploy arg
Hgherzog Dec 20, 2024
d0acbea
update to last friday as dataset date
Hgherzog Dec 20, 2024
dba2f10
fix tests
Hgherzog Dec 20, 2024
0b52756
Merge branch 'master' into henryh/favyen/forest-loss-20240917
Hgherzog Dec 20, 2024
a7f2b82
cancel concurrent forest loss dirver jobs
Hgherzog Dec 20, 2024
0d8d6dd
fix lint
Hgherzog Dec 20, 2024
990dd6b
fix yaml and add test to catch bad load
Hgherzog Dec 20, 2024
ee08cf8
better document how cloudiness is determined
Hgherzog Jan 6, 2025
5dd43b0
Use a local index cache to prevent concurrent object update errors
Hgherzog Jan 6, 2025
1f5fbe3
Add comment about new cahce dir
Hgherzog Jan 6, 2025
4dd1b1e
remove unneded ai comment
Hgherzog Jan 6, 2025
bc609b5
fix parsing
Hgherzog Jan 6, 2025
c69139b
log env vars and copy index cahce back
Hgherzog Jan 6, 2025
b6234fd
ready for new run
Hgherzog Jan 7, 2025
ccfbab7
add file system to the index cache so it won't be treated as relative
Hgherzog Jan 7, 2025
07bdedd
adjust setting to last 90 days
Hgherzog Jan 7, 2025
bbda270
update to new folder
Hgherzog Jan 7, 2025
b3d22cf
fix recopying step for forest loss driver
Hgherzog Jan 8, 2025
af43d44
setting schedule to a weekly chron
Hgherzog Jan 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@
.env
lightning_logs
wandb
**/test_data/**/**/*.tif
19 changes: 9 additions & 10 deletions .github/workflows/build_test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -75,19 +75,12 @@ jobs:
echo "ghcr.io Docker image name is ${GHCR_IMAGE}"
echo "ghcr_image_name=\"${GHCR_IMAGE}\"" >> $GITHUB_OUTPUT

# TODO: Make sure skylight can grab the image tag and deploy

test:
runs-on: ubuntu-latest
runs-on: ubuntu-latest-m
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Log in to the Container registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}

- name: Log in to the Container registry
uses: docker/login-action@v3
Expand All @@ -113,6 +106,11 @@ jobs:
with:
credentials_json: ${{ secrets.GOOGLE_CREDENTIALS }}

- name: Run unit tests with Docker Compose
run: |
docker compose -f docker-compose.yaml run \
test pytest tests/unit/

- name: Run tests with Docker Compose
run: |
docker compose -f docker-compose.yaml run \
Expand All @@ -122,7 +120,8 @@ jobs:
-e GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-credentials.json \
-e RSLP_BUCKET=rslearn-eai \
-e RSLP_PREFIX=gs://rslearn-eai \
test pytest tests/ --ignore tests/integration_slow/
test pytest tests/integration/ --ignore tests/integration_slow/ -vv
Hgherzog marked this conversation as resolved.
Show resolved Hide resolved


- name: Clean up
if: always()
Expand Down Expand Up @@ -176,7 +175,7 @@ jobs:
-e GOOGLE_APPLICATION_CREDENTIALS=/tmp/gcp-credentials.json \
-e RSLP_BUCKET=rslearn-eai \
-e RSLP_PREFIX=gs://rslearn-eai \
rslearn_projects-test pytest tests/integration_slow/
rslearn_projects-test pytest tests/integration_slow/ -vv
Hgherzog marked this conversation as resolved.
Show resolved Hide resolved

- name: Clean up
if: always()
Expand Down
328 changes: 328 additions & 0 deletions .github/workflows/deploy_image_on_vm.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,328 @@
#!/bin/bash
Hgherzog marked this conversation as resolved.
Show resolved Hide resolved
cleanup() {
if [ -n "$VM_NAME" ]; then
echo -e "\nCleaning up VM $VM_NAME..."
gcloud compute instances delete "$VM_NAME" --zone="$ZONE" --quiet || true
fi
exit 1
}

# Set up trap for SIGINT (CTRL+C)
trap cleanup SIGINT
# Parse command line arguments
usage() {
echo "Usage: $0 [options]"
echo "Options:"
echo " --project-id GCP project ID (default: skylight-proto-1)"
echo " --zone GCP zone (default: us-west1-b)"
echo " --machine-type VM machine type (default: e2-micro)"
echo " --docker-image Docker image to run"
echo " --command Command to run in container on the vm"
echo " --user User (default: henryh)"
echo " --ghcr-user GitHub Container Registry user (default: allenai)"
echo " --delete Delete VM after completion (yes/no)"
echo " --beaker-token Beaker token"
echo " --beaker-addr Beaker address"
echo " --beaker-username Beaker username associated with the token"
echo " --rslp-project rslp project name (e.g forest_loss_driver)"
echo " --rslp-prefix rslp prefix"
echo " --workflow workflow name (e.g predict_pipeline) to run on beaker"
echo " --gpu-count Number of GPUs to use"
echo " --shared-memory Amount of shared memory"
echo " --cluster Cluster to use"
echo " --priority Priority level"
echo " --task-name Name of the task"
echo " --budget Budget to use"
echo " --workspace Workspace name"
exit 1
}

# Default values
PROJECT_ID="skylight-proto-1"
ZONE="us-west1-b"
MACHINE_TYPE="e2-micro"
IMAGE_FAMILY="debian-11"
IMAGE_PROJECT="debian-cloud"
USER="henryh"
GHCR_USER="allenai"
DELETE_VM="no"
GPU_COUNT="1"
SHARED_MEMORY="64Gib"
CLUSTER="ai2/jupiter-cirrascale-2"
PRIORITY="normal"
TASK_NAME="forest_loss_driver_inference_$(uuidgen | cut -c1-8)"
BUDGET="ai2/d5"
WORKSPACE="ai2/earth-systems"

# Parse arguments
while [ $# -gt 0 ]; do
case "$1" in
--project-id)
shift
PROJECT_ID="$1"
;;
--zone)
shift
ZONE="$1"
;;
--machine-type)
shift
MACHINE_TYPE="$1"
;;
--docker-image)
shift
DOCKER_IMAGE="$1"
;;
--command)
shift
COMMAND="$1"
;;
--user)
shift
USER="$1"
;;
--ghcr-user)
shift
GHCR_USER="$1"
;;
--delete)
shift
DELETE_VM="$1"
;;
--beaker-token)
shift
BEAKER_TOKEN="$1"
;;
--beaker-addr)
shift
BEAKER_ADDR="$1"
;;
--beaker-username)
shift
BEAKER_USERNAME="$1"
;;
--service-account)
shift
SERVICE_ACCOUNT="$1"
;;
--rslp-project)
shift
RSLP_PROJECT="$1"
;;
--rslp-prefix)
shift
RSLP_PREFIX="$1"
;;
--gpu-count)
shift
GPU_COUNT="$1"
;;
--shared-memory)
shift
SHARED_MEMORY="$1"
;;
--cluster)
shift
CLUSTER="$1"
;;
--priority)
shift
PRIORITY="$1"
;;
--task-name)
shift
TASK_NAME="$1"
;;
--budget)
shift
BUDGET="$1"
;;
--workspace)
shift
WORKSPACE="$1"
;;
--extra_args_model_predict)
shift
EXTRA_ARGS_MODEL_PREDICT="$1"
;;
-h|--help)
usage
;;
*)
echo "Unknown parameter: $1"
usage
;;
esac
shift
done

# Validate required arguments
if [ -z "$DOCKER_IMAGE" ]; then
echo "Error: --docker-image is required"
usage
fi

if [ -z "$COMMAND" ]; then
echo "Error: --command is required"
usage
fi
job_name="forest-loss-driver-inference-$(uuidgen | cut -c1-8)"
# Generate VM name
VM_NAME="rslp-$job_name"

# TODO: add back instance termination action and max run duration
create_vm() {
local vm_name="$1"
local project_id="$2"
local zone="$3"
local machine_type="$4"
local image_family="$5"
local image_project="$6"
local ghcr_user="$7"
local user="$8"
local docker_image="${9}"
local command="${10}"
local beaker_token="${11}"
local beaker_addr="${12}"
local beaker_username="${13}"
local service_account="${14}"
local rslp_project="${15}"
local gpu_count="${16}"
local shared_memory="${17}"
local cluster="${18}"
local priority="${19}"
local task_name="${20}"
local budget="${21}"
local workspace="${22}"
local rslp_prefix="${23}"
local extra_args_model_predict="${24}"
echo "Creating VM $vm_name in project $project_id..." && \
echo "Logged into GCP as $(gcloud config get-value account)" && \
echo "$(gcloud config list)" && \
if ! gcloud compute instances create "$vm_name" \
--project="$project_id" \
--zone="$zone" \
--machine-type="$machine_type" \
--service-account="$service_account" \
--scopes=cloud-platform \
--metadata=\
ops-agents-install='{"name": "ops-agent"}',\
google-logging-enable=TRUE,\
google-monitoring-enable=TRUE,\
enable-osconfig=TRUE,\
ghcr-user="$ghcr_user",\
user="$user",\
docker-image="$docker_image",\
command="$command",\
beaker-token="$beaker_token",\
beaker-addr="$beaker_addr",\
beaker_username="$beaker_username",\
rslp-project="$rslp_project",\
gpu-count="$gpu_count",\
shared-memory="$shared_memory",\
cluster="$cluster",\
priority="$priority",\
task-name="$task_name",\
budget="$budget",\
workspace="$workspace",\
rslp-prefix="$rslp_prefix",\
index-cache-dir="$INDEX_CACHE_DIR",\
tile-store-root-dir="$TILE_STORE_ROOT_DIR",\
extra_args_model_predict="$extra_args_model_predict" \
--metadata-from-file=startup-script=<(echo '#!/bin/bash
# Create a log dir
sudo mkdir -p /var/log/startup-script

# Redirect all output only to the log file to avoid buffer.Scanner token too long errors
exec 1> "/var/log/startup-script/startup.log" 2>&1

echo "Starting startup script at $(date)"

sudo apt-get update && \
sudo apt-get install -y docker.io && \
sudo systemctl start docker && \
export USER=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/user) && \
sudo usermod -aG docker $USER && \
export GHCR_TOKEN=$(gcloud secrets versions access latest --secret="ghcr_pat_forest_loss") && \
export GHCR_USER=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/ghcr-user) && \
export DOCKER_IMAGE=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/docker-image) && \
echo "Logging into GHCR" && \
echo "GHCR_TOKEN: $GHCR_TOKEN" && \
echo "GHCR_USER: $GHCR_USER" && \
echo $GHCR_TOKEN | sudo docker login ghcr.io -u $GHCR_USER --password-stdin && \
echo "Pulling Docker image" && \
sudo docker pull $DOCKER_IMAGE && \
echo "Docker image pulled" && \
export PL_API_KEY=$(gcloud secrets versions access latest --secret="planet_api_key_forest_loss") && \
export INDEX_CACHE_DIR=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/index-cache-dir) && \
export TILE_STORE_ROOT_DIR=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/tile-store-root-dir) && \
export LOCAL_INDEX_CACHE_DIR="/tmp/index_cache" && \
mkdir -p $LOCAL_INDEX_CACHE_DIR && \
gsutil -m cp -r $INDEX_CACHE_DIR/* $LOCAL_INDEX_CACHE_DIR/ && \
export COMMAND=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/command) && \
sudo docker run \
-e CLOUDSDK_AUTH_ACCESS_TOKEN=$(gcloud auth application-default print-access-token) \
-e PL_API_KEY=$PL_API_KEY \
-e TILE_STORE_ROOT_DIR=$TILE_STORE_ROOT_DIR \
-e INDEX_CACHE_DIR=file:///index_cache \
-v $LOCAL_INDEX_CACHE_DIR:/index_cache \
$DOCKER_IMAGE /bin/bash -c "$COMMAND" && \
echo "Data Extraction Complete" && \
if ! gsutil -m cp -r $LOCAL_INDEX_CACHE_DIR/* $INDEX_CACHE_DIR/; then
echo "WARNING: Failed to copy index cache back to $INDEX_CACHE_DIR" >&2
else
echo "Successfully copied index cache back to $INDEX_CACHE_DIR"
fi && \
export BEAKER_TOKEN=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/beaker-token) && \
export BEAKER_ADDR=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/beaker-addr) && \
curl -s '\''https://beaker.org/api/v3/release/cli?os=linux&arch=amd64'\'' | sudo tar -zxv -C /usr/local/bin ./beaker && \
export IMAGE_ID=$(docker images --format "{{.ID}}" $DOCKER_IMAGE | head -n 1) && \
export BEAKER_IMAGE_NAME=$(date +%Y%m%d_%H%M%S)_$(echo $DOCKER_IMAGE | tr '/' '_' | tr ':' '_' | tr -cd '[:alnum:]-') && \
export WORKSPACE=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/workspace) && \
beaker config set default_workspace $WORKSPACE && \
echo "Creating Beaker image" && \
beaker image create $IMAGE_ID --name $BEAKER_IMAGE_NAME && \
echo "Image uploaded to Beaker" && \
export BEAKER_USERNAME=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/beaker_username) && \
export GPU_COUNT=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/gpu-count) && \
export SHARED_MEMORY=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/shared-memory) && \
export CLUSTER=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/cluster) && \
export PRIORITY=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/priority) && \
export TASK_NAME=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/task-name) && \
export BUDGET=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/budget) && \
export RSLP_PREFIX=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/rslp-prefix) && \
export RSLP_PROJECT=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/rslp-project) && \
export EXTRA_ARGS=$(curl -H "Metadata-Flavor: Google" http://metadata.google.internal/computeMetadata/v1/instance/attributes/extra_args_model_predict) && \
export INFERENCE_JOB_LAUNCH_COMMAND="python rslp/$RSLP_PROJECT/job_launcher.py \
--project $RSLP_PROJECT \
--workflow predict \
--image $BEAKER_USERNAME/$BEAKER_IMAGE_NAME \
--gpu_count $GPU_COUNT \
--shared_memory $SHARED_MEMORY \
--cluster $CLUSTER \
--priority $PRIORITY \
--task_name $TASK_NAME \
--budget $BUDGET \
--workspace $WORKSPACE \
--extra_args $EXTRA_ARGS" && \
echo "INFERENCE_JOB_LAUNCH_COMMAND: $INFERENCE_JOB_LAUNCH_COMMAND" && \
echo "Launching inference job on Beaker" && \
docker run -e BEAKER_TOKEN=$BEAKER_TOKEN \
-e BEAKER_ADDR=$BEAKER_ADDR \
-e RSLP_PREFIX=$RSLP_PREFIX \
$DOCKER_IMAGE /bin/bash -c "$INFERENCE_JOB_LAUNCH_COMMAND" && \
echo "Model inference launched!"
') \
--image-family="$image_family" \
--image-project="$image_project" \
--boot-disk-size=200GB; then
echo "Failed to create VM instance"
exit 1
fi
echo "Done!"
}

# Create the VM
create_vm "$VM_NAME" "$PROJECT_ID" "$ZONE" "$MACHINE_TYPE" "$IMAGE_FAMILY" "$IMAGE_PROJECT" "$GHCR_USER" "$USER" "$DOCKER_IMAGE" "$COMMAND" "$BEAKER_TOKEN" "$BEAKER_ADDR" "$BEAKER_USERNAME" "$SERVICE_ACCOUNT" "$RSLP_PROJECT" "$GPU_COUNT" "$SHARED_MEMORY" "$CLUSTER" "$PRIORITY" "$TASK_NAME" "$BUDGET" "$WORKSPACE" "$RSLP_PREFIX" "$EXTRA_ARGS_MODEL_PREDICT"

echo "Done!"
Loading
Loading