Multi pulsar deployment #20

pauldg · 2024-10-18T11:43:18Z

This PR loops over the galaxyproject.pulsar role with variables specific to each deployment.
The shared pulsar variables have been copied over from https://github.com/usegalaxy-eu/vgcn/blob/dev/ansible/group_vars/pulsar.yml

bgruening

There are so many pulsar.yml files ... can we give them more descriptinve names maybe?

mira-miracoli · 2024-10-21T14:09:54Z

tf/ansible/pulsar.yml

+      - pulsar_systemd_service_name: pulsar_be
+        pulsar_config_dir: "{{ pulsar_root }}/config_be"
+        message_queue_url: REAL_MESSAGE_QUEUE_URL_BE
+        pulsar_persistence_dir: /data/share/be/persisted_data


Should this be configurable?
I could imagine the playbook iterates over a list of dicts which is provided by a terraform variable, if we want to stick to deployment using one single command; similar to this

@gm-ds is further working on this. We need the list of queues as input for endpoint configuration, with also the “short name” of the endpoint, e.g. it01..., for path configuration.
We can, probably, parse the queue to retrieve the short name. Do you think this can be fine?

mira-miracoli · 2024-10-21T14:10:31Z

Very cool, thank you @pauldg 🚀

sanjaysrikakulam · 2024-10-28T10:04:31Z

Cool, thanks, @pauldg. I assume this deployment style would mean that the admin/user deploying this should know about the different pulsars (diff Galaxy instances) that they plan to deploy in advance, right? What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

mira-miracoli · 2024-10-28T13:15:44Z

What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

terraform is not aware of software config as far as I am aware of.
I would recommend in the docs to tear down and recreate. Maybe we should make sure that galaxy recovers jobs in these cases

sanjaysrikakulam · 2024-10-29T07:48:44Z

What happens when one wants to deploy/set up multiple pulsars in a later stage (after the initial deployment with one)? Will Terraform run the playbook, or will it say there are no infrastructural changes and that it does not execute/deploy the additional pulsar conf?

terraform is not aware of software config as far as I am aware of. I would recommend in the docs to tear down and recreate. Maybe we should make sure that galaxy recovers jobs in these cases

Tearing down every time and recreating is not a viable or sustainable solution. I would suggest the following simple way.

Keep this method that this PR implements (for the scenarios where the admin/user already knows that they want to configure multiple pulsars)
Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).

mtangaro · 2024-11-13T12:47:55Z

2. Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).

Hi @sanjaysrikakulam, I've extracted the playbook and I'm configuring it. I've configured the endpoint for running with "eu" and "it". The strange thing is that if I run a job from .eu, it is using the wrong assign method, with ids, in the staging dir.

The app.yml is
`---

assign_ids: none
conda_auto_init: false
conda_auto_install: false
container_image_cache_path: /data/share/var/database/container_cache
dependency_resolvers_config_file: dependency_resolvers_conf.xml
job_metrics_config_file: /opt/pulsar/config/job_metrics_conf.xml
managers:
benchmarking:
submit_universe: vanilla
type: queued_condor
production:
submit_universe: vanilla
type: queued_condor
preprocess_action_max_retries: 10
preprocess_action_interval_start: 2
preprocess_action_interval_step: 2
preprocess_action_interval_max: 30
postprocess_action_max_retries: 10
postprocess_action_interval_start: 2
postprocess_action_interval_step: 2
postprocess_action_interval_max: 30
test:
submit_universe: vanilla
type: queued_condor
message_queue_url: pyamqp://galaxy_it03:[email protected]:5671//pulsar/galaxy_it03?ssl=1
min_polling_interval: 0.5
persistence_directory: /data/share/eu/persisted_data
staging_directory: /data/share/eu/staging
tool_dependency_dir: /data/share/tools/`

But I get:
root@it03-central-manager-usegalaxy-it:/data/share/eu/staging$ ls
344 345 361 75395955 75395967 75396384 75396386 75396528 75396539 75396568

Did you experience something like this?

ps. 344, 345, 361 I think were very old jobs that eu submitted when I've reconfigured the endpoint.

sanjaysrikakulam · 2024-11-13T16:47:58Z

2. Extract this also as a playbook (a separate one) and write documentation that would allow the admins/users to deploy multiple pulsars at their will without affecting the existing deployment(s).
Hi @sanjaysrikakulam, I've extracted the playbook and I'm configuring it. I've configured the endpoint for running with "eu" and "it". The strange thing is that if I run a job from .eu, it is using the wrong assign method, with ids, in the staging dir.

The app.yml is `---

assign_ids: none conda_auto_init: false conda_auto_install: false container_image_cache_path: /data/share/var/database/container_cache dependency_resolvers_config_file: dependency_resolvers_conf.xml job_metrics_config_file: /opt/pulsar/config/job_metrics_conf.xml managers: benchmarking: submit_universe: vanilla type: queued_condor production: submit_universe: vanilla type: queued_condor preprocess_action_max_retries: 10 preprocess_action_interval_start: 2 preprocess_action_interval_step: 2 preprocess_action_interval_max: 30 postprocess_action_max_retries: 10 postprocess_action_interval_start: 2 postprocess_action_interval_step: 2 postprocess_action_interval_max: 30 test: submit_universe: vanilla type: queued_condor message_queue_url: pyamqp://galaxy_it03:[email protected]:5671//pulsar/galaxy_it03?ssl=1 min_polling_interval: 0.5 persistence_directory: /data/share/eu/persisted_data staging_directory: /data/share/eu/staging tool_dependency_dir: /data/share/tools/`

But I get: root@it03-central-manager-usegalaxy-it:/data/share/eu/staging$ ls 344 345 361 75395955 75395967 75396384 75396386 75396528 75396539 75396568

Did you experience something like this?

ps. 344, 345, 361 I think were very old jobs that eu submitted when I've reconfigured the endpoint.

Hey Marco,

I don't totally understand your problem (please clarify if I have misunderstood it) because the IDs seem okay. When Pulsar is not configured with assign_ids=uuid, it will use the IDs assigned by the Galaxy instances. The old ones you have in the staging directory are probably differently configured or something in the past. Our (EU) job IDs are currently in this range 75396568.

mtangaro · 2024-11-15T09:58:27Z

I don't totally understand your problem (please clarify if I have misunderstood it) because the IDs seem okay. When Pulsar is not configured with assign_ids=uuid, it will use the IDs assigned by the Galaxy instances. The old ones you have in the staging directory are probably differently configured or something in the past. Our (EU) job IDs are currently in this range 75396568.

The problem is that, as usual, Pulsar is configured with "assign_ids: none" but I have IDs, so jobs are failing.

mtangaro · 2024-11-15T10:42:44Z

update (ty @sanjaysrikakulam).
The endpoint need proper configuration in the destinations.yml file, as shown here.
I'll update properly the documentation and upload the playbook after cleaning it a bit.
In principle we can keep EU in the usual path and national instances in sub dirs.

mtangaro · 2025-02-03T13:26:10Z

Multi pulsar deployment:

PR with tf and ansible: Multiple Pulsar Services Deployment #23
PR with documentation: New Pages for the Multiple Pulsar Services Deployment pulsar-network-docs#14

draft multi pulsar deployment

528cdfb

pauldg mentioned this pull request Oct 18, 2024

Automate the installation of multiple pulsar services on one node #19

Open

bgruening reviewed Oct 18, 2024

View reviewed changes

mira-miracoli reviewed Oct 21, 2024

View reviewed changes

gm-ds mentioned this pull request Jan 22, 2025

Multiple Pulsar Services Deployment #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi pulsar deployment #20

Multi pulsar deployment #20

pauldg commented Oct 18, 2024 •

edited

Loading

bgruening left a comment

mira-miracoli Oct 21, 2024

mtangaro Dec 2, 2024 •

edited

Loading

mira-miracoli commented Oct 21, 2024

sanjaysrikakulam commented Oct 28, 2024

mira-miracoli commented Oct 28, 2024

sanjaysrikakulam commented Oct 29, 2024

mtangaro commented Nov 13, 2024 •

edited

Loading

sanjaysrikakulam commented Nov 13, 2024 •

edited

Loading

mtangaro commented Nov 15, 2024

mtangaro commented Nov 15, 2024

mtangaro commented Feb 3, 2025

Multi pulsar deployment #20

Are you sure you want to change the base?

Multi pulsar deployment #20

Conversation

pauldg commented Oct 18, 2024 • edited Loading

bgruening left a comment

Choose a reason for hiding this comment

mira-miracoli Oct 21, 2024

Choose a reason for hiding this comment

mtangaro Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

mira-miracoli commented Oct 21, 2024

sanjaysrikakulam commented Oct 28, 2024

mira-miracoli commented Oct 28, 2024

sanjaysrikakulam commented Oct 29, 2024

mtangaro commented Nov 13, 2024 • edited Loading

sanjaysrikakulam commented Nov 13, 2024 • edited Loading

mtangaro commented Nov 15, 2024

mtangaro commented Nov 15, 2024

mtangaro commented Feb 3, 2025

pauldg commented Oct 18, 2024 •

edited

Loading

mtangaro Dec 2, 2024 •

edited

Loading

mtangaro commented Nov 13, 2024 •

edited

Loading

sanjaysrikakulam commented Nov 13, 2024 •

edited

Loading