todo


Now
---
Remove temp dirs with lpjs-remove-me

If dispatchd detects a broken connection with compd:
    dispatchd should cancel all jobs on that node
    * compd should always detect broken sockets at the same time
      and respond as described below

If compd detects a broken connection with dispatchd:
    compd should terminate all jobs on the node
	unless it was notified about a dispatchd restart
	send a different signal to chaperone processes, telling them
	not to try to report job completion to dispatchd

dispatchd hangs if pull-command doesn't work
    abalone native rsync doesn't support --mkpath

Fix cancel to work when chaperone is dead
    Make sure cancel never dies as long as node is up
	Should retry indefinitely

Deal with running jobs when no completion report is or will be sent
    lpjs_compd: Report back if chaperone process is dead
    Detect down compute nodes and remove running jobs
	This is hard to distinguish from network issues

Update RCUG

Add lpjs.1 man page, referencing all other man pages
    Need work on auto-man2man and subcommands.sh

Blacklist scripts that are canceled due to RSS violations
until mem-per-proc is corrected

Blacklist scripts that allocate much more than the peak RSS use
until mem-per-proc is corrected

Check munge uid of all messages to prevent spoof attacks
    Must match the uid used during compd checkin

When Mac goes down due to full disk access, resource remain allocated
    Same for timeouts?
    Still an issue?

dispatchd hangs while compute node is shutting down
    Make sure all socket reads have a timeout

lpjs cancel doesn't work on jobs that failed and got requeued

Save user-specified node states to disk and reload at startup
Add maintenance state that won't be changed to up at compd checkin

Reset job state and node resources for jobs that fail to start due to os err

Improve launchd interfacing so munged is properly stopped


Later
-----
Tolerate memory violations if there are no pending jobs

Add hostname resolution check to all admin scripts

Job dependencies

Support real numbers for pmem, useful when using GB or GiB

MPI support

Resource limits via rctl, cgroups, etc?

Submission parameter:
    vmem per proc

Add #lpjs concurrent-job-limit

Optional submission parameters
    #lpjs access submit-dir|path
    submit-dir means node has access to the submit directory
    path means node has access to a file or directory
    #lpjs command command
    Command is any program in the standard PATH on node
	PATH may differ across compute nodes, as they may run different OSs
    Limit jobs to specific set of nodes, e.g. on high-speed network
	has_feature, specified in config file
	By hostname

Convert sbatch scripts to lpjs and vice versa
    #lpjs <-> #SBATCH
    LPJS_ARRAY_INDEX <-> SLURM_ARRAY_TASK_ID

Add optional disk usage requirements to job specs, e.g.
    #lpjs du /tmp 20GiB