-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeout stream keep alive for Upgrades, Restores and Migrations #9
Timeout stream keep alive for Upgrades, Restores and Migrations #9
Conversation
@dsavineau I just pushed some new changes here. I tested:
|
CI is succeeding because a status it is looking for isn't set or isn't right. It is converging when testing on my Openshift cluster. I need to investigate more on this. |
lifecycle: | ||
postStart: | ||
exec: | ||
command: ["/bin/bash", "-c", "mkdir -p /var/lib/pulp/tmp"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure to understand why this change is needed, could you explain a little bit please ?
AFAIK that directory should always exist before the api process starts either:
- from the api entrypoint which creates it for file storage backend
- from empty directory volume for non file storage backen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsavineau without that change in the api and content deployments, I see errors like these in the api and content pods:
Traceback (most recent call last):
File "/usr/local/bin/pulpcore-manager", line 8, in <module>
sys.exit(manage())
File "/usr/local/lib/python3.8/site-packages/pulpcore/app/manage.py", line 11, in manage
execute_from_command_line(sys.argv)
File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
utility.execute()
File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv
self.execute(*args, **cmd_options)
File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute
output = self.handle(*args, **options)
File "/usr/local/lib/python3.8/site-packages/pulpcore/app/management/commands/add-signing-service.py", line 83, in handle
SigningService.objects.create(
File "/usr/local/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
obj.save(force_insert=True, using=self.db)
File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 826, in save
self.validate()
File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 858, in validate
with tempfile.TemporaryDirectory(dir=settings.WORKING_DIRECTORY) as temp_directory_name:
File "/usr/lib64/python3.8/tempfile.py", line 780, in __init__
self.name = mkdtemp(suffix, prefix, dir)
File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp
_os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/pulp/tmp/tmpi0kc8lx1'
I confirmed that the /var/lib/pulp/tmp
directory does get created by the Dockerfile, because it is present when running the image locally with podman. However, when deploying with storage_type: File
, the content PVC gets mapped to /var/lib/pulp
, which overwrites the pre-created tmp directory..
This led me to look into why it seems like the entrypoint script is not pre-creating these directories in the PVC, and it turns out that the Pulp team removed this, I'm not sure why.
The new galaxy-ng entrypoint doesn't pre-create these files either...
It looks like they can get away with not having it in the entrypoint because they mount empty dir volumes here - https://github.com/ansible/galaxy_ng/blob/master/dev/docker-compose.yml#L23-L26
So our options are:
- Option A: Create these directories in the PVC with postStart
- Option B: Get a change in to the entrypoint script to pre-create these directories in the PVC
- Option C: volume mount separate empty directories for the sub-directories (:x: defeats the purpose of a pvc-backed storage at /var/lib/pulp)
I think we should do Option B. I can remove the postStart approach, it will work downstream as is. It is already broken when using the latest galaxy-minimal images because of the pulp-oci-images entrypoint changes. Context here.
We can work with the galaxy folks to modify the entrypoint script to pre-create the directories. cc @aknochow
I just tested out a basic install, which was successful. I then ran through the complete backup and restore flow: First I deployed a fresh galaxy instance called Backup Apply the following
Wait for it to reconcile: Post backup statuses$ oc get pulpbackup backup1 -o jsonpath={.status} | jq
{
"adminPasswordSecret": "galaxy-admin-password",
"backupClaim": "galaxy-backup-claim",
"backupDirectory": "/backups/openshift-backup-2024-01-26-060258",
"backupNamespace": "galaxy",
"conditions": [
{
"lastTransitionTime": "2024-01-26T06:03:53Z",
"reason": "Successful",
"status": "True",
"type": "BackupComplete"
},
{
"lastTransitionTime": "2024-01-26T06:03:44Z",
"reason": "",
"status": "False",
"type": "Failure"
},
{
"lastTransitionTime": "2024-01-26T06:02:18Z",
"reason": "Successful",
"status": "True",
"type": "Running"
},
{
"lastTransitionTime": "2024-01-26T06:03:53Z",
"reason": "Successful",
"status": "True",
"type": "Successful"
}
],
"containerTokenSecret": "galaxy-container-auth",
"databaseConfigurationSecret": "galaxy-postgres-configuration",
"dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
"deploymentName": "galaxy",
"deploymentStorageType": "File"
} Restore Create a restore object
Wait for it to reconcile Post restore statuses
No errors were observed in the operator logs, save for the pulp-route 503 which is a known bug that only happens on the first reconciliation loop. Statuses of new Pulp CR$ oc get pulp new-galaxy -o jsonpath={.status} | jq
{
"adminPasswordSecret": "admin-password-secret",
"conditions": [
{
"lastTransitionTime": "2024-01-26T06:11:42+00:00",
"message": "Creating new-galaxy-api-svc Service resource",
"reason": "CreatingService",
"status": "False",
"type": "Galaxy-API-Ready"
},
{
"lastTransitionTime": "2024-01-26T06:11:05+00:00",
"message": "Galaxy operator tasks running",
"reason": "OperatorRunning",
"status": "False",
"type": "Galaxy-Operator-Finished-Execution"
},
{
"lastTransitionTime": "2024-01-26T06:11:18+00:00",
"message": "All Postgres tasks ran successfully",
"reason": "DatabaseTasksFinished",
"status": "True",
"type": "Database-Ready"
},
{
"lastTransitionTime": "2024-01-26T06:11:29+00:00",
"message": "All Galaxy-content tasks ran successfully",
"reason": "ContentTasksFinished",
"status": "True",
"type": "Galaxy-Content-Ready"
},
{
"lastTransitionTime": "2024-01-26T06:10:24+00:00",
"message": "All Galaxy-worker tasks ran successfully",
"reason": "WorkerTasksFinished",
"status": "True",
"type": "Galaxy-Worker-Ready"
},
{
"lastTransitionTime": "2024-01-26T06:08:15+00:00",
"message": "Checking routes",
"reason": "CheckingRoutes",
"status": "True",
"type": "Galaxy-Routes-Ready"
}
],
"containerTokenSecret": "galaxy-container-auth",
"databaseConfigurationSecret": "new-galaxy-postgres-configuration",
"dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
"deployedImage": "quay.io/pulp/galaxy-minimal:4.7.1",
"storagePersistentVolumeClaim": "new-galaxy-file-storage",
"storageType": "File",
"webURL": "https://new-galaxy-galaxy.apps.aap-dev.ocp4.testing.ansible.com"
} |
- Adds keep-alives for Upgrades, Backups, Restores and Migrations - Modify how pg password is set in postgres pod - related: ansible/awx-operator#1540 Signed-off-by: Christian M. Adams <[email protected]>
- set resolvable host every required - Use pg_restore return code to determine if the task should be marked as failed - Add booleanSwitch x-descriptor for force_db_drop setting
- this allows us to run the pulp-content role before the pulp-api role - the pulp-content role now runs before the pulp-api role - now the content PVC is created before the database migrations happen, making it possible to wait for the restore role to finish migrating data if it is already running. - Move signing secret configuration to common role
SUMMARY
A series of improvements to the restore role to be able to handle very large backups.
Ports of the following awx-operator PR's: