Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ninja -v指令出错导致transformer_inference.so文件缺失 #12

Open
Debouter opened this issue Aug 10, 2023 · 4 comments
Open

ninja -v指令出错导致transformer_inference.so文件缺失 #12

Debouter opened this issue Aug 10, 2023 · 4 comments

Comments

@Debouter
Copy link

Hi~
我在运行demo.py时出现了以下Error:

Traceback (most recent call last):
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
    ......
ImportError: /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

我初步认为这是ninja -v指令执行存在问题,导致共享目标文件transformer_inference.so没有生成。

我已经尝试了网上解决Command '['ninja', '-v']' returned non-zero exit status 1的各种方法,例如安装或禁用ninja库、降低pytorch版本等,但都无法解决这个问题。

我使用的环境如下:

  • python==3.10.12
  • torch/cuda/deepspeed版本均与你的环境一致

请问你是否遇到过这个问题?如果没有的话可否分享一下你的transformer_inference.so文件,该文件大概在路径<user_path>/.cache/torch_extensions/pyXX_cuXX/transformer_inference处。

谢谢!

@CoinCheung
Copy link
Owner

Hi,

Would you post your full error message? I do not have this problem.

@Debouter
Copy link
Author

Here is the whole stack trace. Btw, could u please tell me the versions of GCC and Ninja you use?

Using /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Traceback (most recent call last):
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1893, in _run_ninja_build
    subprocess.run(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 64, in <module>
    res = infer_with_deepspeed(model_name, prompt)
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 40, in infer_with_deepspeed
    model.model = deepspeed.init_inference(model.model, config=infer_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    self._apply_injection_policy(config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 523, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 766, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
    _, layer_id = _replace_module(child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 823, in _replace_module
    Loading extension module transformer_inference...replaced_module = policies[child.__class__][0](child,

  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 500, in replace_fn
Traceback (most recent call last):
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 64, in <module>
    new_module = replace_with_policy(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_with_policy
res = infer_with_deepspeed(model_name, prompt)
  File "/mnt/petrelfs/klk/gdGPT/demo.py", line 40, in infer_with_deepspeed
    _container.create_module()
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/containers/bloom.py", line 30, in create_module
model.model = deepspeed.init_inference(model.model, config=infer_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/__init__.py", line 342, in init_inference
    self.module = DeepSpeedBloomInference(_config, mp_group=self.mp_group)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_bloom.py", line 20, in __init__
engine = InferenceEngine(model, config=ds_inference_config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 192, in __init__
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
self._apply_injection_policy(config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _apply_injection_policy
    inference_module = builder.load()
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 523, in replace_transformer_layer
    return self.jit_load(verbose)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
replaced_module = replace_module(model=model,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 766, in replace_module
op_module = load(name=self.name,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
return _jit_compile(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1509, in _jit_compile
    _, layer_id = _replace_module(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 847, in _replace_module
_write_ninja_file_and_build_library(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1624, in _write_ninja_file_and_build_library
    _, layer_id = _replace_module(child,
      File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 823, in _replace_module
_run_ninja_build(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1909, in _run_ninja_build
    replaced_module = policies[child.__class__][0](child,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 500, in replace_fn
    raise RuntimeError(message) from e
    RuntimeErrornew_module = replace_with_policy(child,: Error building extension 'transformer_inference'

  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 348, in replace_with_policy
    _container.create_module()
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/module_inject/containers/bloom.py", line 30, in create_module
    self.module = DeepSpeedBloomInference(_config, mp_group=self.mp_group)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_bloom.py", line 20, in __init__
    super().__init__(config, mp_group, quantize_scales, quantize_groups, merge_count, mlp_extra_grouping)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/model_implementations/transformers/ds_transformer.py", line 58, in __init__
    inference_module = builder.load()
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 454, in load
    return self.jit_load(verbose)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 497, in jit_load
    op_module = load(name=self.name,
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1535, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/mnt/petrelfs/klk/anaconda3/envs/ds/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1929, in _import_module_from_library
    module = importlib.util.module_from_spec(spec)
  File "<frozen importlib._bootstrap>", line 571, in module_from_spec
  File "<frozen importlib._bootstrap_external>", line 1176, in create_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
ImportError: /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118/transformer_inference/transformer_inference.so: cannot open shared object file: No such file or directory

@CoinCheung
Copy link
Owner

Hi,

the output of running ninja --version on my machine is :

1.11.1.git.kitware.jobserver-1

and the output of running gcc -v on my machine is:

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/7/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 7.5.0-3ubuntu1~18.04' --with-bugurl=file:///usr/share/doc/gcc-7/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++ --prefix=/usr --with-gcc-major-version-only --program-suffix=-7 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-libmpx --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04) 

Would you rm -rf /mnt/petrelfs/klk/.cache/torch_extensions/py310_cu118 and try again?

@Debouter
Copy link
Author

Well, I have fixed it by adjusting the version of gcc to match yours, removing the file u mentioned above, and setting export TORCH_EXTENSIONS_DIR=/tmp according to microsoft/DeepSpeed#3356.

Though similar problems occasionally occur during other installations, it works fine in this repo. Anyway, thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants