Timeout stream keep alive for Upgrades, Restores and Migrations #9

rooftopcellist · 2023-12-05T21:54:55Z

Modify how pg password is set in postgres pod - related: Modify how pg password is set in postgres pod awx-operator#1540

SUMMARY

A series of improvements to the restore role to be able to handle very large backups.

Ports of the following awx-operator PR's:

roles/postgres/templates/postgres.yaml.j2

bundle/manifests/pulp-operator.clusterserviceversion.yaml

config/manifests/bases/pulp-operator.clusterserviceversion.yaml

playbooks/pulp.yml

roles/backup/tasks/postgres.yml

roles/restore/tasks/postgres.yml

rooftopcellist · 2024-01-17T19:00:52Z

@dsavineau I just pushed some new changes here. I tested:

Fresh install
Backup
Restore to a new deployment name with no cleanup
Restore from a backup in an otherwise completely clean namespace.

rooftopcellist · 2024-01-18T14:50:27Z

CI is succeeding because a status it is looking for isn't set or isn't right. It is converging when testing on my Openshift cluster. I need to investigate more on this.

dsavineau · 2024-01-24T22:19:53Z

roles/pulp-api/templates/pulp-api.deployment.yaml.j2

+          lifecycle:
+            postStart:
+              exec:
+                command: ["/bin/bash", "-c", "mkdir -p /var/lib/pulp/tmp"]


not sure to understand why this change is needed, could you explain a little bit please ?

AFAIK that directory should always exist before the api process starts either:

from the api entrypoint which creates it for file storage backend

from empty directory volume for non file storage backen

@dsavineau without that change in the api and content deployments, I see errors like these in the api and content pods:

Traceback (most recent call last): File "/usr/local/bin/pulpcore-manager", line 8, in <module> sys.exit(manage()) File "/usr/local/lib/python3.8/site-packages/pulpcore/app/manage.py", line 11, in manage execute_from_command_line(sys.argv) File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line utility.execute() File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv self.execute(*args, **cmd_options) File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute output = self.handle(*args, **options) File "/usr/local/lib/python3.8/site-packages/pulpcore/app/management/commands/add-signing-service.py", line 83, in handle SigningService.objects.create( File "/usr/local/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/usr/local/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create obj.save(force_insert=True, using=self.db) File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 826, in save self.validate() File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 858, in validate with tempfile.TemporaryDirectory(dir=settings.WORKING_DIRECTORY) as temp_directory_name: File "/usr/lib64/python3.8/tempfile.py", line 780, in __init__ self.name = mkdtemp(suffix, prefix, dir) File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp _os.mkdir(file, 0o700) FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/pulp/tmp/tmpi0kc8lx1'

I confirmed that the /var/lib/pulp/tmp directory does get created by the Dockerfile, because it is present when running the image locally with podman. However, when deploying with storage_type: File, the content PVC gets mapped to /var/lib/pulp, which overwrites the pre-created tmp directory..

This led me to look into why it seems like the entrypoint script is not pre-creating these directories in the PVC, and it turns out that the Pulp team removed this, I'm not sure why.

pulp/pulp-oci-images@ee0d8be#diff-761f9f30e489c48baf6aa471708506d271a0992ef396f40f694a7bb94911ce4eL3-L5

The new galaxy-ng entrypoint doesn't pre-create these files either...

start-api

entrypoint.sh

It looks like they can get away with not having it in the entrypoint because they mount empty dir volumes here - https://github.com/ansible/galaxy_ng/blob/master/dev/docker-compose.yml#L23-L26

So our options are:

Option A: Create these directories in the PVC with postStart

Option B: Get a change in to the entrypoint script to pre-create these directories in the PVC

Option C: volume mount separate empty directories for the sub-directories (:x: defeats the purpose of a pvc-backed storage at /var/lib/pulp)

I think we should do Option B. I can remove the postStart approach, it will work downstream as is. It is already broken when using the latest galaxy-minimal images because of the pulp-oci-images entrypoint changes. Context here.

We can work with the galaxy folks to modify the entrypoint script to pre-create the directories. cc @aknochow

rooftopcellist · 2024-01-26T06:24:18Z

I just tested out a basic install, which was successful. I then ran through the complete backup and restore flow:

First I deployed a fresh galaxy instance called galaxy

Backup

Apply the following

apiVersion: pulp.pulpproject.org/v1beta1
kind: PulpBackup
metadata:
  name: backup1
  namespace: galaxy
spec:
  no_log: false
  deployment_name: galaxy

Wait for it to reconcile:

Post backup statuses

$ oc get pulpbackup backup1 -o jsonpath={.status} | jq
{
  "adminPasswordSecret": "galaxy-admin-password",
  "backupClaim": "galaxy-backup-claim",
  "backupDirectory": "/backups/openshift-backup-2024-01-26-060258",
  "backupNamespace": "galaxy",
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:03:53Z",
      "reason": "Successful",
      "status": "True",
      "type": "BackupComplete"
    },
    {
      "lastTransitionTime": "2024-01-26T06:03:44Z",
      "reason": "",
      "status": "False",
      "type": "Failure"
    },
    {
      "lastTransitionTime": "2024-01-26T06:02:18Z",
      "reason": "Successful",
      "status": "True",
      "type": "Running"
    },
    {
      "lastTransitionTime": "2024-01-26T06:03:53Z",
      "reason": "Successful",
      "status": "True",
      "type": "Successful"
    }
  ],
  "containerTokenSecret": "galaxy-container-auth",
  "databaseConfigurationSecret": "galaxy-postgres-configuration",
  "dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
  "deploymentName": "galaxy",
  "deploymentStorageType": "File"
}

Restore

Create a restore object

apiVersion: pulp.pulpproject.org/v1beta1
kind: PulpRestore
metadata:
  name: restore1
  namespace: galaxy
spec:
  no_log: false
  backup_source: CR
  backup_name: backup1
  deployment_name: new-galaxy

Wait for it to reconcile

Post restore statuses

$ oc get pulprestore restore1 -o jsonpath={.status} | jq
{
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:08:07Z",
      "reason": "Successful",
      "status": "True",
      "type": "RestoreComplete"
    },
    {
      "lastTransitionTime": "2024-01-26T06:07:35Z",
      "reason": "",
      "status": "False",
      "type": "Failure"
    },
    {
      "lastTransitionTime": "2024-01-26T06:04:34Z",
      "reason": "Successful",
      "status": "True",
      "type": "Running"
    },
    {
      "lastTransitionTime": "2024-01-26T06:08:07Z",
      "reason": "Successful",
      "status": "True",
      "type": "Successful"
    }
  ],
  "restoreComplete": true
}

No errors were observed in the operator logs, save for the pulp-route 503 which is a known bug that only happens on the first reconciliation loop.

Statuses of new Pulp CR

$ oc get pulp new-galaxy -o jsonpath={.status} | jq
{
  "adminPasswordSecret": "admin-password-secret",
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:11:42+00:00",
      "message": "Creating new-galaxy-api-svc Service resource",
      "reason": "CreatingService",
      "status": "False",
      "type": "Galaxy-API-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:05+00:00",
      "message": "Galaxy operator tasks running",
      "reason": "OperatorRunning",
      "status": "False",
      "type": "Galaxy-Operator-Finished-Execution"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:18+00:00",
      "message": "All Postgres tasks ran successfully",
      "reason": "DatabaseTasksFinished",
      "status": "True",
      "type": "Database-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:29+00:00",
      "message": "All Galaxy-content tasks ran successfully",
      "reason": "ContentTasksFinished",
      "status": "True",
      "type": "Galaxy-Content-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:10:24+00:00",
      "message": "All Galaxy-worker tasks ran successfully",
      "reason": "WorkerTasksFinished",
      "status": "True",
      "type": "Galaxy-Worker-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:08:15+00:00",
      "message": "Checking routes",
      "reason": "CheckingRoutes",
      "status": "True",
      "type": "Galaxy-Routes-Ready"
    }
  ],
  "containerTokenSecret": "galaxy-container-auth",
  "databaseConfigurationSecret": "new-galaxy-postgres-configuration",
  "dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
  "deployedImage": "quay.io/pulp/galaxy-minimal:4.7.1",
  "storagePersistentVolumeClaim": "new-galaxy-file-storage",
  "storageType": "File",
  "webURL": "https://new-galaxy-galaxy.apps.aap-dev.ocp4.testing.ansible.com"
}

- Adds keep-alives for Upgrades, Backups, Restores and Migrations - Modify how pg password is set in postgres pod - related: ansible/awx-operator#1540 Signed-off-by: Christian M. Adams <[email protected]>

- set resolvable host every required - Use pg_restore return code to determine if the task should be marked as failed - Add booleanSwitch x-descriptor for force_db_drop setting

…when restoring

- this allows us to run the pulp-content role before the pulp-api role - the pulp-content role now runs before the pulp-api role - now the content PVC is created before the database migrations happen, making it possible to wait for the restore role to finish migrating data if it is already running. - Move signing secret configuration to common role

rooftopcellist requested a review from dsavineau December 6, 2023 14:54

dsavineau reviewed Dec 6, 2023

View reviewed changes

roles/postgres/templates/postgres.yaml.j2 Show resolved Hide resolved

rooftopcellist requested a review from dsavineau December 6, 2023 22:30

dsavineau approved these changes Dec 6, 2023

View reviewed changes

rooftopcellist mentioned this pull request Dec 7, 2023

During restore, only wait for the new deployment CR to be created #11

Merged

dsavineau reviewed Dec 12, 2023

View reviewed changes

dsavineau reviewed Jan 24, 2024

View reviewed changes

rooftopcellist mentioned this pull request Jan 25, 2024

updating start args and creating directories necessary for pulp #20

Merged

rooftopcellist requested a review from dsavineau January 25, 2024 16:54

rooftopcellist added 5 commits January 26, 2024 01:25

Timeout stream keep alive for data migrations

fe3bc88

- Adds keep-alives for Upgrades, Backups, Restores and Migrations - Modify how pg password is set in postgres pod - related: ansible/awx-operator#1540 Signed-off-by: Christian M. Adams <[email protected]>

Add option to force drop database before restore

50f3489

- set resolvable host every required - Use pg_restore return code to determine if the task should be marked as failed - Add booleanSwitch x-descriptor for force_db_drop setting

Always check and wait for a restore pg_restore to finish

4818da1

Refactor to use include_roles and wait for content PVC to be created …

d96b0ce

…when restoring

rooftopcellist merged commit d59b633 into ansible:main Jan 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout stream keep alive for Upgrades, Restores and Migrations #9

Timeout stream keep alive for Upgrades, Restores and Migrations #9

rooftopcellist commented Dec 5, 2023 •

edited

Loading

rooftopcellist commented Jan 17, 2024 •

edited

Loading

rooftopcellist commented Jan 18, 2024

dsavineau Jan 24, 2024

rooftopcellist Jan 25, 2024

rooftopcellist commented Jan 26, 2024 •

edited

Loading

Timeout stream keep alive for Upgrades, Restores and Migrations #9

Timeout stream keep alive for Upgrades, Restores and Migrations #9

Conversation

rooftopcellist commented Dec 5, 2023 • edited Loading

SUMMARY

rooftopcellist commented Jan 17, 2024 • edited Loading

rooftopcellist commented Jan 18, 2024

dsavineau Jan 24, 2024

Choose a reason for hiding this comment

rooftopcellist Jan 25, 2024

Choose a reason for hiding this comment

rooftopcellist commented Jan 26, 2024 • edited Loading

rooftopcellist commented Dec 5, 2023 •

edited

Loading

rooftopcellist commented Jan 17, 2024 •

edited

Loading

rooftopcellist commented Jan 26, 2024 •

edited

Loading