Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi pulsar deployment #20

Draft
wants to merge 1 commit into
base: public
Choose a base branch
from

Conversation

pauldg
Copy link

@pauldg pauldg commented Oct 18, 2024

This PR loops over the galaxyproject.pulsar role with variables specific to each deployment.
The shared pulsar variables have been copied over from https://github.com/usegalaxy-eu/vgcn/blob/dev/ansible/group_vars/pulsar.yml

Copy link
Member

@bgruening bgruening left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are so many pulsar.yml files ... can we give them more descriptinve names maybe?

- pulsar_systemd_service_name: pulsar_be
pulsar_config_dir: "{{ pulsar_root }}/config_be"
message_queue_url: REAL_MESSAGE_QUEUE_URL_BE
pulsar_persistence_dir: /data/share/be/persisted_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be configurable?
I could imagine the playbook iterates over a list of dicts which is provided by a terraform variable, if we want to stick to deployment using one single command; similar to this

Copy link
Contributor

@mtangaro mtangaro Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gm-ds is further working on this. We need the list of queues as input for endpoint configuration, with also the “short name” of the endpoint, e.g. it01..., for path configuration.
We can, probably, parse the queue to retrieve the short name. Do you think this can be fine?

@mira-miracoli
Copy link
Contributor

Very cool, thank you @pauldg 🚀

@sanjaysrikakulam
Copy link
Member

Cool, thanks, @pauldg. I assume this deployment style would mean that the admin/user deploying this should know about the different pulsars (diff Galaxy instances) that they plan to deploy in advance, right? What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

@mira-miracoli
Copy link
Contributor

What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

terraform is not aware of software config as far as I am aware of.
I would recommend in the docs to tear down and recreate. Maybe we should make sure that galaxy recovers jobs in these cases

@sanjaysrikakulam
Copy link
Member

What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

terraform is not aware of software config as far as I am aware of. I would recommend in the docs to tear down and recreate. Maybe we should make sure that galaxy recovers jobs in these cases

Tearing down every time and recreating is not a viable or sustainable solution. I would suggest the following simple way.

  1. Keep this method that this PR implements (for the scenarios where the admin/user already knows that they want to configure multiple pulsars)
  2. Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).

@mtangaro
Copy link
Contributor

mtangaro commented Nov 13, 2024

2. Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).

Hi @sanjaysrikakulam, I've extracted the playbook and I'm configuring it. I've configured the endpoint for running with "eu" and "it". The strange thing is that if I run a job from .eu, it is using the wrong assign method, with ids, in the staging dir.

The app.yml is
`---

assign_ids: none
conda_auto_init: false
conda_auto_install: false
container_image_cache_path: /data/share/var/database/container_cache
dependency_resolvers_config_file: dependency_resolvers_conf.xml
job_metrics_config_file: /opt/pulsar/config/job_metrics_conf.xml
managers:
benchmarking:
submit_universe: vanilla
type: queued_condor
production:
submit_universe: vanilla
type: queued_condor
preprocess_action_max_retries: 10
preprocess_action_interval_start: 2
preprocess_action_interval_step: 2
preprocess_action_interval_max: 30
postprocess_action_max_retries: 10
postprocess_action_interval_start: 2
postprocess_action_interval_step: 2
postprocess_action_interval_max: 30
test:
submit_universe: vanilla
type: queued_condor
message_queue_url: pyamqp://galaxy_it03:[email protected]:5671//pulsar/galaxy_it03?ssl=1
min_polling_interval: 0.5
persistence_directory: /data/share/eu/persisted_data
staging_directory: /data/share/eu/staging
tool_dependency_dir: /data/share/tools/`

But I get:
root@it03-central-manager-usegalaxy-it:/data/share/eu/staging$ ls
344 345 361 75395955 75395967 75396384 75396386 75396528 75396539 75396568

Did you experience something like this?

ps. 344, 345, 361 I think were very old jobs that eu submitted when I've reconfigured the endpoint.

@sanjaysrikakulam
Copy link
Member

sanjaysrikakulam commented Nov 13, 2024

2. Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).

Hi @sanjaysrikakulam, I've extracted the playbook and I'm configuring it. I've configured the endpoint for running with "eu" and "it". The strange thing is that if I run a job from .eu, it is using the wrong assign method, with ids, in the staging dir.

The app.yml is `---

assign_ids: none conda_auto_init: false conda_auto_install: false container_image_cache_path: /data/share/var/database/container_cache dependency_resolvers_config_file: dependency_resolvers_conf.xml job_metrics_config_file: /opt/pulsar/config/job_metrics_conf.xml managers: benchmarking: submit_universe: vanilla type: queued_condor production: submit_universe: vanilla type: queued_condor preprocess_action_max_retries: 10 preprocess_action_interval_start: 2 preprocess_action_interval_step: 2 preprocess_action_interval_max: 30 postprocess_action_max_retries: 10 postprocess_action_interval_start: 2 postprocess_action_interval_step: 2 postprocess_action_interval_max: 30 test: submit_universe: vanilla type: queued_condor message_queue_url: pyamqp://galaxy_it03:[email protected]:5671//pulsar/galaxy_it03?ssl=1 min_polling_interval: 0.5 persistence_directory: /data/share/eu/persisted_data staging_directory: /data/share/eu/staging tool_dependency_dir: /data/share/tools/`

But I get: root@it03-central-manager-usegalaxy-it:/data/share/eu/staging$ ls 344 345 361 75395955 75395967 75396384 75396386 75396528 75396539 75396568

Did you experience something like this?

ps. 344, 345, 361 I think were very old jobs that eu submitted when I've reconfigured the endpoint.

Hey Marco,

I don't totally understand your problem (please clarify if I have misunderstood it) because the IDs seem okay. When Pulsar is not configured with assign_ids=uuid, it will use the IDs assigned by the Galaxy instances. The old ones you have in the staging directory are probably differently configured or something in the past. Our (EU) job IDs are currently in this range 75396568.

@mtangaro
Copy link
Contributor

I don't totally understand your problem (please clarify if I have misunderstood it) because the IDs seem okay. When Pulsar is not configured with assign_ids=uuid, it will use the IDs assigned by the Galaxy instances. The old ones you have in the staging directory are probably differently configured or something in the past. Our (EU) job IDs are currently in this range 75396568.

The problem is that, as usual, Pulsar is configured with "assign_ids: none" but I have IDs, so jobs are failing.

@mtangaro
Copy link
Contributor

update (ty @sanjaysrikakulam).
The endpoint need proper configuration in the destinations.yml file, as shown here.
I'll update properly the documentation and upload the playbook after cleaning it a bit.
In principle we can keep EU in the usual path and national instances in sub dirs.

@mtangaro
Copy link
Contributor

mtangaro commented Feb 3, 2025

Multi pulsar deployment:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants