Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout stream keep alive for Upgrades, Restores and Migrations #9

Merged
merged 5 commits into from
Jan 26, 2024
Merged

Timeout stream keep alive for Upgrades, Restores and Migrations #9

merged 5 commits into from
Jan 26, 2024

Conversation

playbooks/pulp.yml Outdated Show resolved Hide resolved
roles/backup/tasks/postgres.yml Outdated Show resolved Hide resolved
roles/restore/tasks/postgres.yml Outdated Show resolved Hide resolved
@rooftopcellist
Copy link
Member Author

rooftopcellist commented Jan 17, 2024

@dsavineau I just pushed some new changes here. I tested:

  • Fresh install
  • Backup
  • Restore to a new deployment name with no cleanup
  • Restore from a backup in an otherwise completely clean namespace.

@rooftopcellist
Copy link
Member Author

CI is succeeding because a status it is looking for isn't set or isn't right. It is converging when testing on my Openshift cluster. I need to investigate more on this.

Comment on lines 145 to 148
lifecycle:
postStart:
exec:
command: ["/bin/bash", "-c", "mkdir -p /var/lib/pulp/tmp"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure to understand why this change is needed, could you explain a little bit please ?

AFAIK that directory should always exist before the api process starts either:

  • from the api entrypoint which creates it for file storage backend
  • from empty directory volume for non file storage backen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsavineau without that change in the api and content deployments, I see errors like these in the api and content pods:

Traceback (most recent call last):
  File "/usr/local/bin/pulpcore-manager", line 8, in <module>
    sys.exit(manage())
  File "/usr/local/lib/python3.8/site-packages/pulpcore/app/manage.py", line 11, in manage
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 419, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.8/site-packages/django/core/management/__init__.py", line 413, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 354, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.8/site-packages/django/core/management/base.py", line 398, in execute
    output = self.handle(*args, **options)
  File "/usr/local/lib/python3.8/site-packages/pulpcore/app/management/commands/add-signing-service.py", line 83, in handle
    SigningService.objects.create(
  File "/usr/local/lib/python3.8/site-packages/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/db/models/query.py", line 453, in create
    obj.save(force_insert=True, using=self.db)
  File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 826, in save
    self.validate()
  File "/usr/local/lib/python3.8/site-packages/pulpcore/app/models/content.py", line 858, in validate
    with tempfile.TemporaryDirectory(dir=settings.WORKING_DIRECTORY) as temp_directory_name:
  File "/usr/lib64/python3.8/tempfile.py", line 780, in __init__
    self.name = mkdtemp(suffix, prefix, dir)
  File "/usr/lib64/python3.8/tempfile.py", line 358, in mkdtemp
    _os.mkdir(file, 0o700)
FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/pulp/tmp/tmpi0kc8lx1'

I confirmed that the /var/lib/pulp/tmp directory does get created by the Dockerfile, because it is present when running the image locally with podman. However, when deploying with storage_type: File, the content PVC gets mapped to /var/lib/pulp, which overwrites the pre-created tmp directory..

This led me to look into why it seems like the entrypoint script is not pre-creating these directories in the PVC, and it turns out that the Pulp team removed this, I'm not sure why.

The new galaxy-ng entrypoint doesn't pre-create these files either...

It looks like they can get away with not having it in the entrypoint because they mount empty dir volumes here - https://github.com/ansible/galaxy_ng/blob/master/dev/docker-compose.yml#L23-L26


So our options are:

  • Option A: Create these directories in the PVC with postStart
  • Option B: Get a change in to the entrypoint script to pre-create these directories in the PVC
  • Option C: volume mount separate empty directories for the sub-directories (:x: defeats the purpose of a pvc-backed storage at /var/lib/pulp)

I think we should do Option B. I can remove the postStart approach, it will work downstream as is. It is already broken when using the latest galaxy-minimal images because of the pulp-oci-images entrypoint changes. Context here.

We can work with the galaxy folks to modify the entrypoint script to pre-create the directories. cc @aknochow

@rooftopcellist
Copy link
Member Author

rooftopcellist commented Jan 26, 2024

I just tested out a basic install, which was successful. I then ran through the complete backup and restore flow:

First I deployed a fresh galaxy instance called galaxy

Backup

Apply the following

apiVersion: pulp.pulpproject.org/v1beta1
kind: PulpBackup
metadata:
  name: backup1
  namespace: galaxy
spec:
  no_log: false
  deployment_name: galaxy

Wait for it to reconcile:

Post backup statuses
$ oc get pulpbackup backup1 -o jsonpath={.status} | jq
{
  "adminPasswordSecret": "galaxy-admin-password",
  "backupClaim": "galaxy-backup-claim",
  "backupDirectory": "/backups/openshift-backup-2024-01-26-060258",
  "backupNamespace": "galaxy",
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:03:53Z",
      "reason": "Successful",
      "status": "True",
      "type": "BackupComplete"
    },
    {
      "lastTransitionTime": "2024-01-26T06:03:44Z",
      "reason": "",
      "status": "False",
      "type": "Failure"
    },
    {
      "lastTransitionTime": "2024-01-26T06:02:18Z",
      "reason": "Successful",
      "status": "True",
      "type": "Running"
    },
    {
      "lastTransitionTime": "2024-01-26T06:03:53Z",
      "reason": "Successful",
      "status": "True",
      "type": "Successful"
    }
  ],
  "containerTokenSecret": "galaxy-container-auth",
  "databaseConfigurationSecret": "galaxy-postgres-configuration",
  "dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
  "deploymentName": "galaxy",
  "deploymentStorageType": "File"
}

Restore

Create a restore object

apiVersion: pulp.pulpproject.org/v1beta1
kind: PulpRestore
metadata:
  name: restore1
  namespace: galaxy
spec:
  no_log: false
  backup_source: CR
  backup_name: backup1
  deployment_name: new-galaxy

Wait for it to reconcile

Post restore statuses
$ oc get pulprestore restore1 -o jsonpath={.status} | jq
{
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:08:07Z",
      "reason": "Successful",
      "status": "True",
      "type": "RestoreComplete"
    },
    {
      "lastTransitionTime": "2024-01-26T06:07:35Z",
      "reason": "",
      "status": "False",
      "type": "Failure"
    },
    {
      "lastTransitionTime": "2024-01-26T06:04:34Z",
      "reason": "Successful",
      "status": "True",
      "type": "Running"
    },
    {
      "lastTransitionTime": "2024-01-26T06:08:07Z",
      "reason": "Successful",
      "status": "True",
      "type": "Successful"
    }
  ],
  "restoreComplete": true
}

No errors were observed in the operator logs, save for the pulp-route 503 which is a known bug that only happens on the first reconciliation loop.

Statuses of new Pulp CR
$ oc get pulp new-galaxy -o jsonpath={.status} | jq
{
  "adminPasswordSecret": "admin-password-secret",
  "conditions": [
    {
      "lastTransitionTime": "2024-01-26T06:11:42+00:00",
      "message": "Creating new-galaxy-api-svc Service resource",
      "reason": "CreatingService",
      "status": "False",
      "type": "Galaxy-API-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:05+00:00",
      "message": "Galaxy operator tasks running",
      "reason": "OperatorRunning",
      "status": "False",
      "type": "Galaxy-Operator-Finished-Execution"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:18+00:00",
      "message": "All Postgres tasks ran successfully",
      "reason": "DatabaseTasksFinished",
      "status": "True",
      "type": "Database-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:11:29+00:00",
      "message": "All Galaxy-content tasks ran successfully",
      "reason": "ContentTasksFinished",
      "status": "True",
      "type": "Galaxy-Content-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:10:24+00:00",
      "message": "All Galaxy-worker tasks ran successfully",
      "reason": "WorkerTasksFinished",
      "status": "True",
      "type": "Galaxy-Worker-Ready"
    },
    {
      "lastTransitionTime": "2024-01-26T06:08:15+00:00",
      "message": "Checking routes",
      "reason": "CheckingRoutes",
      "status": "True",
      "type": "Galaxy-Routes-Ready"
    }
  ],
  "containerTokenSecret": "galaxy-container-auth",
  "databaseConfigurationSecret": "new-galaxy-postgres-configuration",
  "dbFieldsEncryptionSecret": "galaxy-db-fields-encryption",
  "deployedImage": "quay.io/pulp/galaxy-minimal:4.7.1",
  "storagePersistentVolumeClaim": "new-galaxy-file-storage",
  "storageType": "File",
  "webURL": "https://new-galaxy-galaxy.apps.aap-dev.ocp4.testing.ansible.com"
}

- Adds keep-alives for Upgrades, Backups, Restores and Migrations
- Modify how pg password is set in postgres pod - related: ansible/awx-operator#1540

Signed-off-by: Christian M. Adams <[email protected]>
- set resolvable host every required
- Use pg_restore return code to determine if the task should be marked as failed
- Add booleanSwitch x-descriptor for force_db_drop setting
- this allows us to run the pulp-content role before the pulp-api role
- the pulp-content role now runs before the pulp-api role
- now the content PVC is created before the database migrations happen,
  making it possible to wait for the restore role to finish migrating
  data if it is already running.
- Move signing secret configuration to common role
@rooftopcellist rooftopcellist merged commit d59b633 into ansible:main Jan 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants