Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spike] solver failing due to OOM #2690

Open
3 tasks
harshad16 opened this issue Oct 27, 2022 · 4 comments
Open
3 tasks

[spike] solver failing due to OOM #2690

harshad16 opened this issue Oct 27, 2022 · 4 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps.

Comments

@harshad16
Copy link
Member

harshad16 commented Oct 27, 2022

Describe the bug

The solver solves packages for based on the index_url(for ex: pypi/simple) and further resolve the dependencies.
As dependencies of many package present in one index_url can be found on a different index_url,
in the logic we currently provide all the dependencies URLs for resolution.

The solver with all the dependencies index URLs are failing to execute and runs into OOM.

Screenshot from 2022-10-27 15-19-06
Screenshot from 2022-10-27 15-18-25

To Reproduce
Steps to reproduce the behavior:

  1. Go to thoth-middlertier-stage namespace in cluster
  2. Click on solver pods
  3. See error

Expected behavior
successful execution of solvers.

Acceptance criteria

  • Explore the OOM failure issue in details.
  • Explain the issue of OOM in detail as report for future references
  • Check if optimizing the dependency indexes fixes the issue.
@harshad16 harshad16 added the kind/bug Categorizes issue or PR as related to a bug. label Oct 27, 2022
@harshad16
Copy link
Member Author

/priority important-soon
/sig devsecops

@sesheta sesheta added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps. labels Oct 27, 2022
@harshad16 harshad16 moved this to 🆕 New in Planning Board Nov 3, 2022
@harshad16 harshad16 changed the title solver failing due to OOM [spike] solver failing due to OOM Nov 3, 2022
@harshad16 harshad16 moved this from 🆕 New to 🔖 Next in Planning Board Nov 3, 2022
@harshad16 harshad16 self-assigned this Nov 3, 2022
@harshad16
Copy link
Member Author

harshad16 commented Nov 29, 2022

The Report on the OOM failure diagonsis happening in solvers:

Solver execution extract various information from a package and its dependencies.
One of the aspect is to check if the package is installable or not, that is done via fnc:
https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L93

The installation is checked by installing the package via pip in the virtualenv.
The command to install the package via pip is generated and executed via thoth-analyzer.
https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L110
https://github.com/thoth-station/analyzer/blob/ad12a1ed76ff6aa1606dae3efb47e3bb8d5af61f/thoth/analyzer/command.py#L99

  1. The command constructed is having quotation issue:
    Generated in Debug mode:
    cmd: "Running command 'venv/bin/python3 -m pip install --force-reinstall --no-cache-dir --no-deps torch===1.12.1+cu113 --index-url \"https://download.pytorch.org/whl/cu113\" --trusted-host download.pytorch.org'"

Causing the cmd execution to fail.

WARNING: The index url ""https://download.pytorch.org/whl/cu113"" seems invalid, please provide a scheme.
Looking in indexes: "https://download.pytorch.org/whl/cu113"
WARNING: Location '"https://download.pytorch.org/whl/cu113"/torch/' is ignored: it is either a non-existing path or lacks a specific scheme.
ERROR: Could not find a version that satisfies the requirement torch===1.12.1+cu113 (from versions: none)
ERROR: No matching distribution found for torch===1.12.1+cu113
  1. If the command is invalid, this process fails and the exception is caught.
    Though if the command is valid, it just doesn't complete the execution, the delegator is on wait.
    for example: when the solver is executed for package roundup===2.1.0
    cmd: "python3 -m pip install --force-reinstall --no-cache-dir --no-deps roundup===2.1.0 --index-url https://pypi.org/simple --trusted-host pypi.org"
    extracted from the debug method this would not finish in time, which would drop the solver execution in cluster, due to timeout.

  2. The solver which is able to execute the installation, though has a package that is bigger in size like torch.
    The extraction of the _hashes for that package artifacts consumes all the CPU.
    Screenshot from 2022-11-29 14-43-04
    Execution of this function is consuming all the CPU allotted i.e 100m
    https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229
    Though CPU throttle might not be the reason for OOM kill, this is just one aspect found.
    seems like the extraction of artifacts for hashes is somehow causing a memory leak.

Solving the following bits might resolve the execution of failed solvers.
and provide more information on the OOM failed solvers

@harshad16
Copy link
Member Author

As speculated above:
The function fill_hashes https://github.com/thoth-station/solver/blob/1992d58432f668b3bc1b131ba0a6a75f8254a50d/thoth/solver/python/python.py#L229
gathers hashes from the artifacts: https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/source.py#L441
which is been download to the tmp files https://github.com/thoth-station/python/blob/a8aba6cd9063710335e4e3d4a8f7823f7951a498/thoth/python/artifact.py#L59

As our current memory limit is 768Mi, this memory is consumed on the download size.
will verify this, by replicating it.

@harshad16
Copy link
Member Author

suggestion:

  • Verify if we can block packages from solvers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/devsecops Categorizes an issue or PR as relevant to SIG DevSecOps.
Projects
Status: 🏗 In progress
Development

No branches or pull requests

2 participants