Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pilots run forever #7912

Open
chrisburr opened this issue Nov 28, 2024 · 1 comment
Open

Pilots run forever #7912

chrisburr opened this issue Nov 28, 2024 · 1 comment

Comments

@chrisburr
Copy link
Member

In LHCb we're seeing a significant minority of pilots which never stop even after the job is finished. Looking at what's going on I see it's stuck in this loop:

time.sleep(int(self.am_getOption("PollingTime")))

Process 3512260: /cvmfs/lhcb.cern.ch/lhcbdirac/versions/v11.0.51-1730714484/Linux-x86_64/bin/python3.11 /cvmfs/lhcb.cern.ch/lhcbdirac/versions/v11.0.51-1730714484/Linux-x86_64/bin/dirac-agent WorkloadManagement/JobAgent -o MaxCycles=1 -o PollingTime=20 -o StopOnApplicationFailure=True -o StopAfterFailedMatches=10 -o LogLevel=DEBUG -s /Resources/Computing/CEDefaults -o WorkingDirectory=/localdisk1/dirac/work/tmp.2rDABiQEG1 -o /LocalSite/CPUTime=4890240 -o /DIRAC/Security/UseServerCertificate=yes -o /LocalSite/InstancePath=/localdisk1/dirac/work/tmp.2rDABiQEG1 -o /AgentJobRequirements/ExtraOptions=pilot.cfg --cfg pilot.cfg
Python v3.11.10 (/cvmfs/lhcb.cern.ch/lhcbdirac/versions/v11.0.51-1730714484/Linux-x86_64/bin/python3.11)

Thread 3512260 (idle): "MainThread"
    finalize (DIRAC/WorkloadManagementSystem/Agent/JobAgent.py:898)
        Arguments:
            self: <JobAgent at 0x7f832241ec50>
        Locals:
            res: {"OK": True, "Value": {}}
            result: {"OK": True, "Value": ([], [])}
    __finalize (DIRAC/Core/Base/AgentReactor.py:130)
        Arguments:
            self: <AgentReactor at 0x7f83249da550>
        Locals:
            agentName: "WorkloadManagement/JobAgent"
    go (DIRAC/Core/Base/AgentReactor.py:151)
        Arguments:
            self: <AgentReactor at 0x7f83249da550>
        Locals:
            timeToNext: None
    main (DIRAC/Core/scripts/dirac_agent.py:39)
        Locals:
            positionalArgs: ["WorkloadManagement/JobAgent"]
            localCfg: <LocalConfiguration at 0x7f8322e6fd50>
            agentName: "WorkloadManagement/JobAgent"
            resultDict: {"OK": True, "Value": None}
            agentReactor: <AgentReactor at 0x7f83249da550>
            result: {"OK": True, "Value": None}
    __call__ (DIRAC/Core/Base/Script.py:74)
        Arguments:
            self: <cell at 0x7f8322e6a830>
            func: None
        Locals:
            matches: [<EntryPoint at 0x7f8322cca610>]
            entrypoint: <EntryPoint at 0x7f8322cca610>
            entrypointFunc: <Script at 0x7f8323bf0290>
    <module> (dirac-agent:8)
Thread 3512261 (idle): "Thread-2 (__executorThread)"
    __executorThread (DIRAC/Core/Utilities/ThreadScheduler.py:115)
        Arguments:
            self: <ThreadScheduler at 0x7f8323bf1d10>
        Locals:
            timeToNext: 16.985708951950073
    run (threading.py:982)
        Arguments:
            self: <Thread at 0x7f8322ce5010>
    _bootstrap_inner (threading.py:1045)
        Arguments:
            self: <Thread at 0x7f8322ce5010>
    _bootstrap (threading.py:1002)
        Arguments:
            self: <Thread at 0x7f8322ce5010>
@fstagni
Copy link
Contributor

fstagni commented Nov 29, 2024

A log of one of these pilots would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants