-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Give users SSH access to builders #6
Conversation
6eff144
to
f7b2d2b
Compare
Obviously, Copr resources aren't unlimited so we need to allocate them | ||
wisely. | ||
|
||
- Only 1 running instance per user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This number should be configurable. I can imagine that for internal Copr we can make this number higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we allow Packit to provide this to its users (in the future), all the Packit users will share this limit. So, I would have a configurable instance limit and, ideally, a user-level limit as well.
rules, instance type selection, etc., and a button to actually spawn | ||
the instance. The instance would be fresh and it would be up to the | ||
user to reproduce the failure manually. Once the builder spawns, the | ||
page will show instructions on how to connect to it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this one. "Resubmit and keep for ssh"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this one as well. Edit: changed my mind. The consequence is that we can't guarantee the same kind of VM (in AWS or on HW?).
able to specify at least an architecture. But that is IMHO not good | ||
enough. For example we have x86_64 builders our own hypervisors and in | ||
Amazon AWS. Historically, we had issues that appeared only on builders | ||
from one of them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not care about issues that happens only in some pool. It may happen, but I guess this is very rare. Let them choose only architecture. Or to be precise one chroot.
but try it in an anonymous window, it doesn't work. | ||
|
||
Anyway, Copr already fetches some user information from FAS so I think | ||
we can assume we will be able to fetch the SSH keys as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
And if it means that they need to logoff and login to make this feature start working, that it is cheap price.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I requested this against noggin/fasjson: fedora-infra/fasjson#580
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
- It is not possible to override a status of an already finished build | ||
- Users cannot take a build from the queue (maybe not even theirs) and | ||
process it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod we will know at the beginnig that this build fail and we will give ssh to user. In this case we have to ignore the result.
to 30 hours at most. That means the lifespan needs to be at | ||
least 30 hours. | ||
- I propose 48 hours because we want to give users enough time for | ||
debugging after the build finishes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would give 1/2 or 1 hour max by default, with the possibility to prolong to 2 days as you propose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having a really limited lifespan by default. Most of the people won't need much. Half an hour looks fine to me. Not sure if you have a median of the build time or some other metric we can compare this to.
user-ssh-builders/README.md
Outdated
debugging after the build finishes | ||
- Alternatively, we can do a 24-hours lifespan and add an "extend" | ||
button to Copr WebUI or to add a `copr-builder-extend` command | ||
to the instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, here -> but I think 24 hours by default is too much.
instance is already running. We don't want to kill it and spawn a | ||
new one. We probably need to print it was not possible to spawn a | ||
new instance into the builder-live/backend log | ||
- Where to print the instructions on how to connect to the builder? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The backend.log
(backend is responsible for (not)closing the resalloc ticket)?
new instance into the builder-live/backend log | ||
- Where to print the instructions on how to connect to the builder? | ||
- If the build fails in multiple chroots, which one of those builders | ||
should be kept alive? We can't keep them all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow user to have N such machines, the first N will be kept -> the rest of them would just dump to backend log that "You already have N machines ready for debugging, please release them first.".
should be kept alive? We can't keep them all. | ||
- Inability to select an instance type in advance so it may be hard | ||
to debug builds where e.g. Amazon AWS Spot instance is needed | ||
and nothing else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, we have the exact failing machine in hand, what am I missing?
|
||
Since the workflow starts from a failed build, we could create a MOTD | ||
on the builder showing instructions on how to reproduce the particular | ||
build using `copr-rpmbuild`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I faced many times a "why the heck the build succeeds??" situation, too. Could we have:
--keep-builder-for-debugging=WHEN (failed|succeeded defaults to failed)
|
||
Alternativelly, we could allow users to specify a public URL to their | ||
SSH key, upload it, or copy-paste it to some Copr form, but I don't | ||
like any of these options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. We should also observe possibilities for LDAP for the Red Hat Internal Copr. I think we could make a compromise here if the automatic getting of pub keys is too complicated.
user-ssh-builders/README.md
Outdated
could change the password after logging-in, upload their SSH key, | ||
whatever they want. | ||
|
||
Personally, I prefer `Option 2 - Passwords`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would have to be stored in some private per-user log, or in *_private
database table :-/ I personally prefer Option 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I didn't think of that.
Agreed that we should go with Option 1 if it is doable.
👍
|
||
We need to make sure that: | ||
|
||
- It is not possible to override a status of an already finished build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? This seems to be "for free"... (either we started a new VM to reproduce previous failure, or we get the machine post-build - depending on the chosen approach).
Personally, I prefer `Option 2 - Passwords`. | ||
|
||
|
||
## Security |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure that the resalloc ticket ID is recycled immediately after user finished the debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, actually no - resalloc may re-use the resource with another ticket; the simple solution is to use a randomly generated unique --sandbox
while taking the ticket.
Awesome proposal, thank you @FrostyX moving this forward! |
08dc1de
to
371a433
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really handy!
As mentioned in today's meeting, it might be worth syncing with the Testing Farm folks since they have a very similar feature.
Obviously, Copr resources aren't unlimited so we need to allocate them | ||
wisely. | ||
|
||
- Only 1 running instance per user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we allow Packit to provide this to its users (in the future), all the Packit users will share this limit. So, I would have a configurable instance limit and, ideally, a user-level limit as well.
to 30 hours at most. That means the lifespan needs to be at | ||
least 30 hours. | ||
- I propose 48 hours because we want to give users enough time for | ||
debugging after the build finishes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for having a really limited lifespan by default. Most of the people won't need much. Half an hour looks fine to me. Not sure if you have a median of the build time or some other metric we can compare this to.
Alternativelly, we could allow users to specify a public URL to their | ||
SSH key, upload it, or copy-paste it to some Copr form, but I don't | ||
like any of these options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Being able to set custom SSH key(s) would allow Packit to provide this to its users. We can easily let either people configure the public key or take it from GitHub. The workflow I see here:
- Someone opens a PR.
- Build is triggered.
- Build fails, in the results (or as a comment), Packit suggests the user to try this feature.
- User triggers this (e.g. via
/packit rerun-with-ssh
or some other comment) - Packit uses the Copr Python client to use this feature and set the user's public SSH key (it should be available via GitHub API).
- Packit adds a comment with the instructions on how to connect.
config file on copr-frontend. | ||
|
||
|
||
## Auth |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if PAM plugin (sssd?) would work to auth using kerberos: we use it both in Fedora and internally.
internal Copr by security initiatives. | ||
|
||
|
||
## Security |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you store some secrets on builders?
See fedora-copr/copr#2364