Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync some features with anaconda's recipe #318

Merged
merged 54 commits into from
Jan 30, 2025

Conversation

danpetry
Copy link
Contributor

@danpetry danpetry commented Jan 14, 2025

Checklist

  • Used a personal fork of the feedstock to propose changes
  • Bumped the build number (if the version is unchanged)
  • Reset the build number to 0 (if the version changed)
  • Re-rendered with the latest conda-smithy (Use the phrase @conda-forge-admin, please rerender in a comment in this PR for automated rerendering)
  • Ensured the license file is being packaged.

See commit messages for individual changes.
If there are any changes that aren't wanted, it'd be nice to know before debugging them.
There are some more involved changes we have over our side, and they will follow in further PRs if after investigation they're deemed appropriate. (full diff here, for interest)

@conda-forge-admin
Copy link
Contributor

conda-forge-admin commented Jan 14, 2025

Hi! This is the friendly automated conda-forge-linting service.

I failed to even lint the recipe, probably because of a conda-smithy bug 😢. This likely indicates a problem in your meta.yaml, though. To get a traceback to help figure out what's going on, install conda-smithy and run conda smithy recipe-lint --conda-forge . from the recipe directory. You can also examine the workflow logs for more detail.

This message was generated by GitHub Actions workflow run https://github.com/conda-forge/conda-forge-webservices/actions/runs/12957694482. Examine the logs at this URL for more detail.

@danpetry danpetry marked this pull request as draft January 14, 2025 22:48
Copy link
Member

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Daniel! 🙏

Had a few questions below

+ python_include_dirs
+ torch_include_dirs
+ omp_include_dir_paths
+ + [os.getenv('CONDA_PREFIX') + '/include']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glancing through the source code, it looks inductor accepts an include_dirs flag, which comes from a JSON config file

Could we just add our own JSON config file?

Seems this may be needed in other contexts as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inductor takes the compile_flags.json file for its AOT mode (handled in package.py), but not for its JIT mode. This is the problem the patch is solving - making it look in prefix/include during JIT (torch.compile) compilation. Probably in AOT mode the user wants to specify their own compile flags for their platform, which is what this json file is for.

The code you're looking at is where the AOT code is initializing the base class (BuildOptionsBase). However, we want to initialize the include directories in the child CppTorchOptions class, which is instantiated in cpu_vec_isa.py. It was a while ago I wrote this patch but IIRC that was the path in the stack trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this check out ok for you?

mv build/lib.*/torch/include/{ATen,caffe2,tensorpipe,torch,c10} ${PREFIX}/include/
rm ${PREFIX}/lib/libtorch_python.*

# Keep the original backed up to sed later
cp build/CMakeCache.txt build/CMakeCache.txt.orig
;;
pytorch)
$PREFIX/bin/python -m pip install . --no-deps -vvv --no-clean \
$PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -vvv --no-clean \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we sync these flags across calls and Windows? Seems they vary a bit

Idk whether it is worth it, but we could also consider using script_env to set most of these flags once and reuse them throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for windows we've got

%PYTHON% -m pip %PIP_ACTION% . --no-build-isolation --no-deps -vvv --no-clean

and unix

$PREFIX/bin/python -m pip install . --no-deps --no-build-isolation -vvv --no-clean

So the same AFAICS?
Keeping them here seems more in line with what people will expect given other feedstocks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any further comments?

Comment on lines +95 to +96
export USE_SYSTEM_PYBIND11=1
export USE_SYSTEM_EIGEN_INSTALL=1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens to the PyTorch vendored copies of these when using the system copies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It ignores them: pybind11 eigen

Although in the case of eigen it will use the vendored one if it doesn't find the system one, rather than erroring, which can mean accidental vendoring if you don't look at the logs carefully, which isn't great IMHO

Copy link
Contributor Author

@danpetry danpetry Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a patch to prevent this for mkl but I thought it best to look at the blas/openmp stuff closer and maybe address in a different PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you ok with this?

export MAX_JOBS=${CPU_COUNT}
# Leave a spare core for other tasks. This may need to be reduced further
# if we get out of memory errors.
export MAX_JOBS=$((CPU_COUNT > 1 ? CPU_COUNT - 1 : 1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if CPU_COUNT is undefined or empty string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What CI provider do things run on for anaconda? I think it might need to become specific to that, and not in the overall else: clause here. Our OSX builds on azure work are successful withing the 6h limit, but seem to fail with this PR. This is a likely culprit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's AWS. Happy to just leave it as-is. I think originally we had it maxed at four cores and then increased it to CPU_COUNT-1 because of common practice and a desire to speed things up rather than concrete technical reasons. We can leave this as-is and if we have issues our end solve them later

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've fixed our case in 6074128, so I have no objection to doing what helps you in the else: branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have made this a default in the else branch

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that CPU_COUNT is a passthrough environment variable

So one can set this in their CI scripts to the intended value and conda-build will use it. This is what conda-forge does

Instead of adding this to the build recipe, would recommend doing this in Anaconda's CI scripts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU_COUNT isn't being set here. If it's set (somewhere else, like in CI as you suggest) then it'll be used to set MAX_JOBS. If it's not set, this expression will evaluate to 1

recipe/build.sh Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/meta.yaml Outdated Show resolved Hide resolved
@danpetry
Copy link
Contributor Author

The linter doesn't like this line:

- export MATRIX_GPU_ARCH_VERSION="{{ '.'.join(cuda_compiler_version.split('.')[:2]) }}" # [(cuda_compiler_version != "None") and (linux and x86_64)]

I guess it doesn't like calling .split on cuda_compiler_version. I'm getting TypeError: 'NoneType' object is not callable when rendering and OSError: Feedstock has no recipe/meta.yaml when linting. Deleting this line solves it, but I've tried various jinja stuff and can't solve it while keeping it in. Any ideas from someone who knows what the linter is looking for? If not I can continue...

@jakirkham
Copy link
Member

jakirkham commented Jan 15, 2025

Normally something like this will do the trick

{{ ... (cuda_compiler_version or "").split(".") .. }}

Edit: Can also use Jinja's default filter

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks very nice, thanks a lot. I left some questions on the individual patches.

recipe/meta.yaml Outdated Show resolved Hide resolved
recipe/build.sh Outdated Show resolved Hide resolved
@h-vetinari h-vetinari marked this pull request as ready for review January 28, 2025 21:53
@h-vetinari
Copy link
Member

I think this PR is getting very close to merging. We're down to a tiny handful of test failures, which will hopefully be addressed by the lastest round of pushes. Beyond that, I don't want to shove more feature work into this PR, tensorpipe support on windows (etc.) can come in another PR - the CMake fixes were unavoidable though...

Any other thoughts @danpetry?

@danpetry
Copy link
Contributor Author

I'm good with that

@h-vetinari
Copy link
Member

PS. So glad to see the GPU server back in full force - 4 simultaneous GPU jobs 🤩 - big thank you to @aktech for sorting out some very gnarly hardware issues there! 🙏 🚀

@danpetry
Copy link
Contributor Author

May I ask how you found that PR? By looking at these tests in upstream's main and seeing if anything's changed, suggesting a fix..?

@h-vetinari
Copy link
Member

May I ask how you found that PR? By looking at these tests in upstream's main and seeing if anything's changed, suggesting a fix..?

Going through the blame of the test file. It's a bit cumbersome because you need to do it in a local checkout - the file's too big for GH to load the blame. Example (on v2.6.0-rc9, the line numbers on main have shifted again already)

>git blame -L 11030,11050 -- test/inductor/test_torchinductor.py
afabed6ae608 (James Wu       2024-01-21 07:06:31 -0800 11030)         with torch.no_grad():
afabed6ae608 (James Wu       2024-01-21 07:06:31 -0800 11031)             # With keep_output_stride False, inductor would normally have different layout from eager execution
afabed6ae608 (James Wu       2024-01-21 07:06:31 -0800 11032)             # But because our custom op needs fixed layout, the assertions in the custom op will pass
afabed6ae608 (James Wu       2024-01-21 07:06:31 -0800 11033)             self.common(fn, (inp,), check_lowp=False)
afabed6ae608 (James Wu       2024-01-21 07:06:31 -0800 11034)
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11035)     @requires_gpu()
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11036)     @config.patch(implicit_fallbacks=True)
381213ee8a62 (Benjamin Glass 2024-11-26 16:05:19 +0000 11037)     @skip_if_cpp_wrapper(
381213ee8a62 (Benjamin Glass 2024-11-26 16:05:19 +0000 11038)         "Without major redesign, cpp_wrapper will not support custom ops that are "
381213ee8a62 (Benjamin Glass 2024-11-26 16:05:19 +0000 11039)         "defined in Python."
381213ee8a62 (Benjamin Glass 2024-11-26 16:05:19 +0000 11040)     )
05cb98f91d49 (Eddie Yan      2024-10-30 20:34:14 +0000 11041)     @tf32_on_and_off(0.005)
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11042)     def test_mutable_custom_op_fixed_layout2(self):
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11043)         with torch.library._scoped_library("mylib", "DEF") as lib:
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11044)             mod = nn.Conv2d(3, 128, 1, stride=1, bias=False).to(device=GPU_TYPE)
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11045)             inp = torch.rand(2, 3, 128, 128, device=GPU_TYPE)
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11046)             expected_stride = mod(inp).clone().stride()
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11047)
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11048)             lib.define(
afd081c9d4d4 (rzou           2024-08-21 10:08:01 -0700 11049)                 "bar(Tensor x, bool is_compiling) -> Tensor",
f65a564fa247 (rzou           2024-09-09 07:30:54 -0700 11050)                 tags=torch.Tag.flexible_layout,

From there you find the commit pytorch/pytorch@381213e (and its PR). Since that wasn't the one that introduced skip_if_cpp_wrapper though - go to the parent of that commit, find the line number of the test in the file again, and do another git blame around those lines to find pytorch/pytorch@8fa0479 and its PR.

@danpetry
Copy link
Contributor Author

ok, thanks. Additionally, I see that cpp_wrapper should be off by default, which doesn't match up with the suggestion of the changes that these tests are failing because cpp_wrapper's enabled. Or, are you just taking these changes because they solve the issue by skipping and the patch can be removed with v2.6?

@h-vetinari
Copy link
Member

Or, are you just taking these changes because they solve the issue by skipping and the patch can be removed with v2.6?

TBH I was looking for changes in the failing tests that would be plausible for fixing things, and we get to drop backported patches in the next version upgrade.

I didn't research much further (those rabbit holes can be an enormous time sink, so I went with 🤞 for now), and it's entirely possible that the fix is not sufficient. In that case I'm happy to skip the tests, but as that added skip was staring me in the face, I thought it's worth a shot. 🤷

@danpetry
Copy link
Contributor Author

those rabbit holes can be an enormous time sink

yep. Thanks for the knowledge/process share.

@h-vetinari
Copy link
Member

h-vetinari commented Jan 29, 2025

Hm. We're failing the new CMake tests for the CUDA-enabled pytorch.

Caffe2: Found protobuf with new-style protobuf targets.
Caffe2: Protobuf version 28.3.0
CUDA_TOOLKIT_ROOT_DIR not found or specified
Could NOT find CUDA (missing: CUDA_TOOLKIT_ROOT_DIR CUDA_NVCC_EXECUTABLE CUDA_INCLUDE_DIRS CUDA_CUDART_LIBRARY) 
CMake Warning at /home/conda/feedstock_root/build_artifacts/libtorch_1738101139639/_test_env_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placehold_placeho/share/cmake/Caffe2/public/cuda.cmake:31 (message):
  Caffe2: CUDA cannot be found.  Depending on whether you are building Caffe2
  or a Caffe2 dependent library, the next warning / error will give you more
  info.

Let me add a {{ compiler("cuda") }} to the test section.

@h-vetinari
Copy link
Member

Yay, more failures

+ python -c 'import torch; assert torch.backends.cudnn.is_available()'
+ python -c 'import torch; assert torch.cuda.is_available()'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AssertionError

Presumably also pytorch needs to find something CUDA-related to pass this check?

The test skip was also not effective, as seen on windows. I'll switch to your suggestion with the unique names.

Finally, we still have a problem with the CMake metadata, because the Caffe2 files still use the long-deprecated find_package(CUDA). I'll comment in #333

@danpetry
Copy link
Contributor Author

danpetry commented Jan 29, 2025

Presumably also pytorch needs to find something CUDA-related to pass this check?

yes, it uses cudart to query the cuda device count: https://github.com/pytorch/pytorch/blob/40ccb7a86d456bc2fb9e45e4b33c774fbf0b3e46/torch/cuda/__init__.py#L125

potentially, it's worth removing this check, because it's a test of the platform not of the package itself. Probably just leaving this check should be fine: torch.backends.cuda.is_built()
Having said that, it does let the tests fail early if there's no cuda support on the platform, which is necessary to run the tests. I'm surprised it's failing since this is running on a machine with a GPU..?

@danpetry
Copy link
Contributor Author

find_package(CUDA)

do you want some help with this or are you on it? I don't want to duplicate work but happy to help

@h-vetinari
Copy link
Member

do you want some help with this or are you on it? I don't want to duplicate work but happy to help

I'd very much appreciate some help with this. There's just too many CUDA_* variables that are (potentially) affected. Let's take this to #333 though to not blow up the thread.

@h-vetinari
Copy link
Member

Aarch+CUDA timed out after 15h (note time increments per line, as well as some failures)

2025-01-29T21:46:49.3943206Z .......s..............ssssss............................................ [ 94%]
2025-01-29T22:18:49.2460065Z ..........FFFFFFF....................................................... [ 95%]
2025-01-29T22:49:07.7088322Z .............s..............................................s........... [ 96%]
2025-01-29T23:28:30.4877456Z ..........ss.ssss......s........................................sss..... [ 96%]
2025-01-29T23:58:21.5877671Z sss...s..sF....................s........................................ [ 97%]
2025-01-30T00:41:33.9959881Z ................................sss................s.....s...s......s... [ 98%]
2025-01-30T00:55:01.2081636Z .....ssssss.......s.s..............ss....................s.............s [ 99%]
2025-01-30T01:15:21.6288586Z ................sss......................s.............................. [ 99%]
2025-01-30T01:17:10.5264273Z ##[error]The operation was canceled.
2025-01-30T01:17:10.5555416Z Post job cleanup.

Still, I'm going to merge this for now to get some new builds with the CMake fixes out. I've discussed with @danpetry that it would be good to have some recent builds downloadable for debugging the CUDA part of #333, and so this will help in that respect as well. For that, I'm skipping the CUDA failures mentioned above, but I'll have yet another PR to try fixing this more properly (+ a bunch of other accumulated stuff).

Copy link
Member

@h-vetinari h-vetinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience here!

@h-vetinari h-vetinari merged commit 8338fd7 into conda-forge:main Jan 30, 2025
21 of 27 checks passed
@h-vetinari
Copy link
Member

yes, it uses cudart to query the cuda device count: https://github.com/pytorch/pytorch/blob/40ccb7a86d456bc2fb9e45e4b33c774fbf0b3e46/torch/cuda/__init__.py#L125

potentially, it's worth removing this check, because it's a test of the platform not of the package itself. Probably just leaving this check should be fine: torch.backends.cuda.is_built() Having said that, it does let the tests fail early if there's no cuda support on the platform, which is necessary to run the tests. I'm surprised it's failing since this is running on a machine with a GPU..?

Well, unfortunately this fails even with the full CUDA toolchain (incl. cudart) present. I don't know how these tests are passing for Anaconda, or what's different here, but this is not working unfortunately.

@h-vetinari
Copy link
Member

WTH, why does the test not find the GPUs anymore? (this is on MKL; meaning it passed torch.cuda.is_available() to get to the python tests)

=========================== short test summary info ============================
FAILED [4.2186s] test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_2d_out_of_bounds_class_index_cuda_float16 - AssertionError: 'CUDA error: device-side assert triggered' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE aten.init.cuda\nE\n======================================================================\nERROR: test_cross_entropy_loss_2d_out_of_bounds_class_index (__main__.TestThatContainsCUDAAssert)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File "<string>", line 16, in test_cross_entropy_loss_2d_out_of_bounds_class_index\n  File "$PREFIX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init\n    torch._C._cuda_init()\nRuntimeError: No CUDA GPUs are available\n\n----------------------------------------------------------------------\nRan 1 test in 0.006s\n\nFAILED (errors=1)\n'

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNNDeviceTypeCUDA.test_cross_entropy_loss_2d_out_of_bounds_class_index_cuda_float16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [3.2179s] test/test_nn.py::TestNNDeviceTypeCUDA::test_cross_entropy_loss_2d_out_of_bounds_class_index_cuda_float32 - AssertionError: 'CUDA error: device-side assert triggered' not found in 'PYTORCH_API_USAGE torch.python.import\nPYTORCH_API_USAGE c10d.python.import\nPYTORCH_API_USAGE aten.init.cuda\nE\n======================================================================\nERROR: test_cross_entropy_loss_2d_out_of_bounds_class_index (__main__.TestThatContainsCUDAAssert)\n----------------------------------------------------------------------\nTraceback (most recent call last):\n  File "<string>", line 16, in test_cross_entropy_loss_2d_out_of_bounds_class_index\n  File "$PREFIX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init\n    torch._C._cuda_init()\nRuntimeError: No CUDA GPUs are available\n\n----------------------------------------------------------------------\nRan 1 test in 0.004s\n\nFAILED (errors=1)\n'

To execute this test, run the following from the base repo dir:
    python test/test_nn.py TestNNDeviceTypeCUDA.test_cross_entropy_loss_2d_out_of_bounds_class_index_cuda_float32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [1.8866s] test/test_torch.py::TestTorchDeviceTypeCUDA::test_cublas_config_nondeterministic_alert_cuda - AssertionError: Subprocess exception while attempting to run function "mm" with config "garbage":
Traceback (most recent call last):
  File "<string>", line 10, in <module>
  File "$PREFIX/lib/python3.10/site-packages/torch/cuda/__init__.py", line 319, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available


To execute this test, run the following from the base repo dir:
    python test/test_torch.py TestTorchDeviceTypeCUDA.test_cublas_config_nondeterministic_alert_cuda

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
FAILED [5.6304s] test/inductor/test_torchinductor.py::TritonCodeGenTests::test_indirect_device_assert - AssertionError: False is not true : first_arg, 2, True, False

To execute this test, run the following from the base repo dir:
    python test/inductor/test_torchinductor.py TritonCodeGenTests.test_indirect_device_assert

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
= 4 failed, 14519 passed, 2657 skipped, 91 xfailed, 143276 warnings in 4387.90s (1:13:07) =

Anyway, I went back to the state before this PR in 4f1b584.

@danpetry danpetry deleted the anaconda-sync branch January 30, 2025 17:33
@h-vetinari
Copy link
Member

h-vetinari commented Jan 30, 2025

Sigh, we really went 4/4 with the CUDA builds failing for various reasons; it looks like the test_torchinductor tests cause a massive runtime regression in emulation of linux+aarch, where builds went from ~11h to >15h. I'm going to skip them there. Finally, it seems the test_torchinductor tests yield (at least on windows)

FAILED [0.0306s] test/inductor/test_torchinductor.py::SweepInputsCpuTest::test_cpu_broadcast2_int - RuntimeError: Python 3.13+ not yet supported for torch.compile

This was referenced Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants