-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathtodo
95 lines (66 loc) · 2.82 KB
/
todo
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
Now
---
Remove temp dirs with lpjs-remove-me
If dispatchd detects a broken connection with compd:
dispatchd should cancel all jobs on that node
* compd should always detect broken sockets at the same time
and respond as described below
If compd detects a broken connection with dispatchd:
compd should terminate all jobs on the node
unless it was notified about a dispatchd restart
send a different signal to chaperone processes, telling them
not to try to report job completion to dispatchd
dispatchd hangs if pull-command doesn't work
abalone native rsync doesn't support --mkpath
Fix cancel to work when chaperone is dead
Make sure cancel never dies as long as node is up
Should retry indefinitely
Deal with running jobs when no completion report is or will be sent
lpjs_compd: Report back if chaperone process is dead
Detect down compute nodes and remove running jobs
This is hard to distinguish from network issues
Update RCUG
Add lpjs.1 man page, referencing all other man pages
Need work on auto-man2man and subcommands.sh
Blacklist scripts that are canceled due to RSS violations
until mem-per-proc is corrected
Blacklist scripts that allocate much more than the peak RSS use
until mem-per-proc is corrected
Check munge uid of all messages to prevent spoof attacks
Must match the uid used during compd checkin
When Mac goes down due to full disk access, resource remain allocated
Same for timeouts?
Still an issue?
dispatchd hangs while compute node is shutting down
Make sure all socket reads have a timeout
lpjs cancel doesn't work on jobs that failed and got requeued
Save user-specified node states to disk and reload at startup
Add maintenance state that won't be changed to up at compd checkin
Reset job state and node resources for jobs that fail to start due to os err
Improve launchd interfacing so munged is properly stopped
Later
-----
Tolerate memory violations if there are no pending jobs
Add hostname resolution check to all admin scripts
Job dependencies
Support real numbers for pmem, useful when using GB or GiB
MPI support
Resource limits via rctl, cgroups, etc?
Submission parameter:
vmem per proc
Add #lpjs concurrent-job-limit
Optional submission parameters
#lpjs access submit-dir|path
submit-dir means node has access to the submit directory
path means node has access to a file or directory
#lpjs command command
Command is any program in the standard PATH on node
PATH may differ across compute nodes, as they may run different OSs
Limit jobs to specific set of nodes, e.g. on high-speed network
has_feature, specified in config file
By hostname
Convert sbatch scripts to lpjs and vice versa
#lpjs <-> #SBATCH
LPJS_ARRAY_INDEX <-> SLURM_ARRAY_TASK_ID
Add optional disk usage requirements to job specs, e.g.
#lpjs du /tmp 20GiB