Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWX Tasks hanging on ansible.utils.cli_parse >= 5.1.0 #382

Open
bewing opened this issue Oct 24, 2024 · 5 comments
Open

AWX Tasks hanging on ansible.utils.cli_parse >= 5.1.0 #382

bewing opened this issue Oct 24, 2024 · 5 comments
Assignees

Comments

@bewing
Copy link

bewing commented Oct 24, 2024

SUMMARY

AWX Tasks running ansible.utils.cli_parse are hanging

ISSUE TYPE
  • Bug Report
COMPONENT NAME

ansible.utils.cli_parse

ANSIBLE VERSION
ansible [core 2.17.5]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/runner/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.11/site-packages/ansible
  ansible collection location = /runner/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.11.10 (main, Sep  9 2024, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-2)] (/usr/bin/python3.11)
  jinja version = 3.1.4
  libyaml = True

COLLECTION VERSION
# /usr/share/ansible/collections/ansible_collections
Collection    Version
------------- -------
ansible.utils 5.1.0
CONFIGURATION

CONFIG_FILE() = /etc/ansible/ansible.cfg
DISPLAY_SKIPPED_HOSTS(/etc/ansible/ansible.cfg) = False

OS / ENVIRONMENT

awx-operator 2.19.1
AWX 24.6.1

STEPS TO REPRODUCE

We have several extremely large AWX jobs that collect information from the CLI and push new configs to switch devices. After upgrading from ansible.utils 5.0.0 to 5.1.0 and 5.1.2, jobs occasionally hang indefinitely, not producing anymore job output. The last log output of every hung job contains "task_action": "ansible.utils.cli_parse"

Producing the same AWX EE image, but with ansible.utils==5.0.0, no jobs hang.

I will see if I can produce a publishable test case, but it may take some time. I did want to get this recorded in case any other operators are seeing similar issues.

@JCTechSol
Copy link

JCTechSol commented Dec 20, 2024

Thank you so much for posting this, I though I was losing my mind. I am experiencing the same behavior with cli_parse when I try to run it on a target that is down. The task properly times out with ssh connection failed: ssh connect failed: Timeout connecting to DEVICE_NAME but then hangs and never moves on. I've even tried adding a task_timeout which triggers but still hangs on the task.

I did what you suggested and downgraded the collection and had no problems.
I tried it with 4.0.0 both with ansible 2.17.4 and ansible 2.18.0, (python 3.12.7) and they both behave the same. It works as expected on 4.0.0 but not on 5.1.2.
In my case I am not using AWX.

It seems to work fine if the list of hosts is small (<10) but if its >20 or so, 5.1.2 seems to choke. It doesn't even need many unreachable hosts to choke, just a single host in the batch that is down and it will hang.

@JCTechSol
Copy link

5.0.0 doesn't seem to work at all, it doesn't seem to actually make a ssh connection to the target. I think that is a bug in 5.0.0 outside of this issue.

@JCTechSol
Copy link

Here is some more detailed testing:
Testing procedure:
inventory of 43 network devices, 1 being offline

        - name: Check Radius Server Bindings
          ansible.utils.cli_parse:
            command: "show radius"
            parser:
              name: ansible.utils.ttp
            set_fact: __radius_status

ansible.utils:
3.0.0 - This works as expected

The following suffer from removing the persistent connection 8dc11de but this was fixed in 5.1.0 31c8097

3.1.0 - Tested with 8dc11de reverted, works as expected
4.0.0 - Tested with 8dc11de reverted, works as expected
4.1.0 - Tested with 8dc11de reverted, does not work, hangs

Troubling thing is I can't identify a change from 4.0.0 to 4.1.0 is responsible for this changed behavior...

@JCTechSol
Copy link

I've tested on ansible-core 2.18.1 and no change

But I did narrow it down a little more;
I have this in my ansible.cfg

[diff]
always = yes

If I comment this line out, it doesn't hang on any version.
Not sure why, I don't see anything between 4.0.0 and 4.1.0 that changed with fact_diff but also not sure why its being called with cli_parse

@Ruchip16
Copy link
Contributor

Ruchip16 commented Jan 9, 2025

hi @bewing @JCTechSol,
thank you for reporting this issue. I’ve been able to reproduce the behavior where tasks using ansible.utils.cli_parse hang indefinitely after encountering an SSH connection timeout (e.g., ssh connect failed: Timeout connecting to DEVICE_NAME).

To investigate further and pinpoint the root cause, I would need the following additional details:

Debugging Logs:
Could you run the playbook with increased verbosity (e.g., -vvv) and share the debug logs leading up to the hang?
Any AWX logs or system-level logs (if applicable) could also help narrow down the issue.

Simplified Test Case:
If possible, sharing a minimal inventory and playbook that reproduces the issue would be helpful to validate against different setups.
I’m open to contributions and would encourage anyone facing this issue to collaborate on identifying a fix. If you’ve already started investigating or have ideas for potential solutions, feel free to share them or submit a PR to the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants