Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch between dataset path in env of docker run and the benchmark run command #91

Closed
anandhu-eng opened this issue Dec 30, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@anandhu-eng
Copy link
Contributor

anandhu-eng commented Dec 30, 2024

run cmd:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev    --model=rgat    --implementation=reference    --framework=pytorch    --category=edge    --scenario=Offline    --execution_mode=test    --device=cuda     --quiet    --test_query_count=10000 --env.CM_DATASET_IGBH_PATH=/data/common/anandhu/igbh/ --rerun  --docker --docker_cm_repo_branch=dev --threads=2 --env.CM_ACTIVATE_RGAT_IN_MEMORY=yes --batch_size=256

output:

CM script::benchmark-program/run.sh

Run Directory: /home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT

CMD: /home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo \${PIPESTATUS[0]} > exitstatus

DEBUG:root:    - Running native script "/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run-ubuntu.sh" from temporal script "tmp-run.sh" in "/home/cmuser" ...
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run-ubuntu.sh from tmp-run.sh

/home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
INFO:main:Namespace(dataset='igbh-dgl', dataset_path='/data/common/anandhu/igbh/', in_memory=True, layout='COO', profile='rgat-dgl-full', scenario='Offline', max_batchsize=256, threads=2, accuracy=False, find_peak_performance=False, backend='dgl', model_name='rgat', output='/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1', qps=None, model_path='/home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt', dtype='fp32', device='gpu', user_conf='/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf', audit_conf='audit.config', time=None, count=None, debug=False, performance_sample_count=5000, max_latency=None, samples_per_query=8)
Traceback (most recent call last):
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/main.py", line 510, in <module>
    main()
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/main.py", line 363, in main
    ds = dataset_class(
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 131, in __init__
    self.igbh_dataset = IGBHeteroGraphStructure(
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 203, in __init__
    self.edge_dict = self.load_edge_dict()
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 237, in load_edge_dict
    loaded_edges = {
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 237, in <dictcomp>
    loaded_edges = {
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 232, in load_edge
    np.load(osp.join(parent_path, edge, "edge_index.npy"), mmap_mode=mmap))
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/numpy/lib/npyio.py", line 427, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/data/common/anandhu/igbh/full/processed/paper__cites__paper/edge_index.npy'

ideally it should be modified from /data/common/anandhu/igbh/ to /cm-mount/data/common/anandhu/igbh

Note: in docker run command, it is correctly being mounted: -v "/data/common/anandhu/igbh":/cm-mount/data/common/anandhu/igbh

@anandhu-eng anandhu-eng changed the title env variable(outside cm cache) not being modified when mount to docker Mismatch between dataset path in env of docker run and the benchmark run command Dec 30, 2024
@anandhu-eng
Copy link
Contributor Author

Upon further examining, found out that the env variable is properly modified(visible in docker run command) --env.CM_DATASET_IGBH_PATH=/cm-mount/data/common/anandhu/igbh but the path in benchmark run command appears to be different:

/home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus

@anandhu-eng
Copy link
Contributor Author

the CM_DATASET_IGBH_PATH env variable is set assigned two times inside the docker run command:

docker run -it --entrypoint '' --group-add $(id -g $USER) --gpus=all --shm-size=32gb --dns 8.8.8.8 --dns 8.8.4.4 --cap-add SYS_ADMIN --cap-add SYS_TIME --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /home/anandhu/runstest:/cm-mount/home/anandhu/runstest -v /home/anandhu/CM/repos/local/cache/b6bd72defb5c4bc3:/home/cmuser/CM/repos/local/cache/b6bd72defb5c4bc3 -v "/data/common/anandhu/igbh":/cm-mount/data/common/anandhu/igbh local/cm-script-app-mlperf-inference-generic--reference--rgat--pytorch--cuda--test--r5.0-dev-default--offline:nvcr.io-nvidia-pytorch-24.03-py3-latest bash -c '(cm pull repo && cm run script --tags=app,mlperf,inference,generic,_reference,_rgat,_pytorch,_cuda,_test,_r5.0-dev_default,_offline --quiet=true --env.CM_DATASET_IGBH_PATH=/data/common/anandhu/igbh/ --env.CM_ACTIVATE_RGAT_IN_MEMORY=yes --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=reference --env.CM_MLPERF_MODEL=rgat --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=False --env.CM_DOCKER_PRIVILEGED_MODE=True --env.CM_MLPERF_LOADGEN_MAX_BATCHSIZE=256 --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=pytorch --env.CM_RERUN=True --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=10000 --env.CM_NUM_THREADS=2 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r5.0-dev_default --env.CM_MLPERF_SUBMISSION_CHECKER_VERSION=v5.0 --env.CM_MLPERF_INFERENCE_SOURCE_VERSION=5.0.2 --env.CM_MLPERF_LAST_RELEASE=v4.1 --env.CM_TMP_CURRENT_PATH=/home/anandhu/runstest --env.CM_TMP_PIP_VERSION_STRING= --env.CM_MODEL=rgat --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.OUTPUT_BASE_DIR=/home/anandhu/runstest --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --add_deps_recursive.coco2014-dataset.tags=_full --add_deps_recursive.igbh-dataset.tags=_full --add_deps_recursive.get-mlperf-inference-results-dir.tags=_version.r5.0-dev --add_deps_recursive.get-mlperf-inference-submission-dir.tags=_version.r5.0-dev --add_deps_recursive.mlperf-inference-nvidia-scratch-space.tags=_version.r5.0-dev --add_deps_recursive.mlperf-inference-implementation.tags=_batch_size.256 --v=False --print_env=False --print_deps=False --dump_version_info=True --env.OUTPUT_BASE_DIR=/cm-mount/home/anandhu/runstest --env.CM_MLPERF_INFERENCE_SUBMISSION_DIR=/home/cmuser/CM/repos/local/cache/b6bd72defb5c4bc3/mlperf-inference-submission --env.CM_DATASET_IGBH_PATH=/cm-mount/data/common/anandhu/igbh && bash ) || bash'

arjunsuresh added a commit to GATEOverflow/mlperf-automations that referenced this issue Dec 30, 2024
@arjunsuresh arjunsuresh self-assigned this Dec 30, 2024
@arjunsuresh arjunsuresh added the bug Something isn't working label Dec 30, 2024
@arjunsuresh
Copy link
Collaborator

@anandhu-eng It should be resolved in this PR

arjunsuresh added a commit to GATEOverflow/mlperf-automations that referenced this issue Dec 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

No branches or pull requests

2 participants