Mismatch between dataset path in env of docker run and the benchmark run command #91

anandhu-eng · 2024-12-30T05:11:56Z

run cmd:

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r5.0-dev    --model=rgat    --implementation=reference    --framework=pytorch    --category=edge    --scenario=Offline    --execution_mode=test    --device=cuda     --quiet    --test_query_count=10000 --env.CM_DATASET_IGBH_PATH=/data/common/anandhu/igbh/ --rerun  --docker --docker_cm_repo_branch=dev --threads=2 --env.CM_ACTIVATE_RGAT_IN_MEMORY=yes --batch_size=256

output:

CM script::benchmark-program/run.sh

Run Directory: /home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT

CMD: /home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo \${PIPESTATUS[0]} > exitstatus

DEBUG:root:    - Running native script "/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run-ubuntu.sh" from temporal script "tmp-run.sh" in "/home/cmuser" ...
INFO:root:         ! cd /home/cmuser
INFO:root:         ! call /home/cmuser/CM/repos/mlcommons@mlperf-automations/script/benchmark-program/run-ubuntu.sh from tmp-run.sh

/home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus
INFO:main:Namespace(dataset='igbh-dgl', dataset_path='/data/common/anandhu/igbh/', in_memory=True, layout='COO', profile='rgat-dgl-full', scenario='Offline', max_batchsize=256, threads=2, accuracy=False, find_peak_performance=False, backend='dgl', model_name='rgat', output='/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1', qps=None, model_path='/home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt', dtype='fp32', device='gpu', user_conf='/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf', audit_conf='audit.config', time=None, count=None, debug=False, performance_sample_count=5000, max_latency=None, samples_per_query=8)
Traceback (most recent call last):
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/main.py", line 510, in <module>
    main()
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/main.py", line 363, in main
    ds = dataset_class(
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 131, in __init__
    self.igbh_dataset = IGBHeteroGraphStructure(
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 203, in __init__
    self.edge_dict = self.load_edge_dict()
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 237, in load_edge_dict
    loaded_edges = {
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 237, in <dictcomp>
    loaded_edges = {
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 621, in result_iterator
    yield _result_or_cancel(fs.pop())
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 319, in _result_or_cancel
    return fut.result(timeout)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cmuser/CM/repos/local/cache/e209a71e674046f8/inference/graph/R-GAT/dgl_utilities/feature_fetching.py", line 232, in load_edge
    np.load(osp.join(parent_path, edge, "edge_index.npy"), mmap_mode=mmap))
  File "/home/cmuser/venv/cm/lib/python3.10/site-packages/numpy/lib/npyio.py", line 427, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/data/common/anandhu/igbh/full/processed/paper__cites__paper/edge_index.npy'

ideally it should be modified from /data/common/anandhu/igbh/ to /cm-mount/data/common/anandhu/igbh

Note: in docker run command, it is correctly being mounted: -v "/data/common/anandhu/igbh":/cm-mount/data/common/anandhu/igbh

The text was updated successfully, but these errors were encountered:

anandhu-eng · 2024-12-30T05:46:28Z

Upon further examining, found out that the env variable is properly modified(visible in docker run command) --env.CM_DATASET_IGBH_PATH=/cm-mount/data/common/anandhu/igbh but the path in benchmark run command appears to be different:

/home/cmuser/venv/cm/bin/python3 main.py  --scenario Offline --dataset-path /data/common/anandhu/igbh/ --device gpu   --max-batchsize 256 --threads 2 --user_conf '/home/cmuser/CM/repos/mlcommons@mlperf-automations/script/generate-mlperf-inference-user-conf/tmp/c883385dc3b94168a9afe0397c9f31e3.conf' --dataset igbh-dgl --profile rgat-dgl-full  --output /cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1 --dtype fp32 --model-path /home/cmuser/CM/repos/local/cache/4dd8308dcab944a5/RGAT/RGAT.pt --in-memory  2>&1 | tee '/cm-mount/home/anandhu/runstest/test_results/ef61842c957b-reference-gpu-pytorch-v2.4.0-cu124/rgat/offline/performance/run_1/console.out'; echo ${PIPESTATUS[0]} > exitstatus

anandhu-eng · 2024-12-30T09:14:11Z

the CM_DATASET_IGBH_PATH env variable is set assigned two times inside the docker run command:

docker run -it --entrypoint '' --group-add $(id -g $USER) --gpus=all --shm-size=32gb --dns 8.8.8.8 --dns 8.8.4.4 --cap-add SYS_ADMIN --cap-add SYS_TIME --security-opt apparmor=unconfined --security-opt seccomp=unconfined -v /home/anandhu/runstest:/cm-mount/home/anandhu/runstest -v /home/anandhu/CM/repos/local/cache/b6bd72defb5c4bc3:/home/cmuser/CM/repos/local/cache/b6bd72defb5c4bc3 -v "/data/common/anandhu/igbh":/cm-mount/data/common/anandhu/igbh local/cm-script-app-mlperf-inference-generic--reference--rgat--pytorch--cuda--test--r5.0-dev-default--offline:nvcr.io-nvidia-pytorch-24.03-py3-latest bash -c '(cm pull repo && cm run script --tags=app,mlperf,inference,generic,_reference,_rgat,_pytorch,_cuda,_test,_r5.0-dev_default,_offline --quiet=true --env.CM_DATASET_IGBH_PATH=/data/common/anandhu/igbh/ --env.CM_ACTIVATE_RGAT_IN_MEMORY=yes --env.CM_QUIET=yes --env.CM_MLPERF_IMPLEMENTATION=reference --env.CM_MLPERF_MODEL=rgat --env.CM_MLPERF_RUN_STYLE=test --env.CM_MLPERF_SKIP_SUBMISSION_GENERATION=False --env.CM_DOCKER_PRIVILEGED_MODE=True --env.CM_MLPERF_LOADGEN_MAX_BATCHSIZE=256 --env.CM_MLPERF_SUBMISSION_SYSTEM_TYPE=edge --env.CM_MLPERF_DEVICE=cuda --env.CM_MLPERF_USE_DOCKER=True --env.CM_MLPERF_BACKEND=pytorch --env.CM_RERUN=True --env.CM_MLPERF_LOADGEN_SCENARIO=Offline --env.CM_TEST_QUERY_COUNT=10000 --env.CM_NUM_THREADS=2 --env.CM_MLPERF_FIND_PERFORMANCE_MODE=yes --env.CM_MLPERF_LOADGEN_ALL_MODES=no --env.CM_MLPERF_LOADGEN_MODE=performance --env.CM_MLPERF_RESULT_PUSH_TO_GITHUB=False --env.CM_MLPERF_SUBMISSION_GENERATION_STYLE=full --env.CM_MLPERF_INFERENCE_VERSION=4.1-dev --env.CM_RUN_MLPERF_INFERENCE_APP_DEFAULTS=r5.0-dev_default --env.CM_MLPERF_SUBMISSION_CHECKER_VERSION=v5.0 --env.CM_MLPERF_INFERENCE_SOURCE_VERSION=5.0.2 --env.CM_MLPERF_LAST_RELEASE=v4.1 --env.CM_TMP_CURRENT_PATH=/home/anandhu/runstest --env.CM_TMP_PIP_VERSION_STRING= --env.CM_MODEL=rgat --env.CM_MLPERF_LOADGEN_COMPLIANCE=no --env.CM_MLPERF_LOADGEN_EXTRA_OPTIONS= --env.CM_MLPERF_LOADGEN_SCENARIOS,=Offline --env.CM_MLPERF_LOADGEN_MODES,=performance --env.OUTPUT_BASE_DIR=/home/anandhu/runstest --env.CM_OUTPUT_FOLDER_NAME=test_results --add_deps_recursive.coco2014-original.tags=_full --add_deps_recursive.coco2014-preprocessed.tags=_full --add_deps_recursive.imagenet-original.tags=_full --add_deps_recursive.imagenet-preprocessed.tags=_full --add_deps_recursive.openimages-original.tags=_full --add_deps_recursive.openimages-preprocessed.tags=_full --add_deps_recursive.openorca-original.tags=_full --add_deps_recursive.openorca-preprocessed.tags=_full --add_deps_recursive.coco2014-dataset.tags=_full --add_deps_recursive.igbh-dataset.tags=_full --add_deps_recursive.get-mlperf-inference-results-dir.tags=_version.r5.0-dev --add_deps_recursive.get-mlperf-inference-submission-dir.tags=_version.r5.0-dev --add_deps_recursive.mlperf-inference-nvidia-scratch-space.tags=_version.r5.0-dev --add_deps_recursive.mlperf-inference-implementation.tags=_batch_size.256 --v=False --print_env=False --print_deps=False --dump_version_info=True --env.OUTPUT_BASE_DIR=/cm-mount/home/anandhu/runstest --env.CM_MLPERF_INFERENCE_SUBMISSION_DIR=/home/cmuser/CM/repos/local/cache/b6bd72defb5c4bc3/mlperf-inference-submission --env.CM_DATASET_IGBH_PATH=/cm-mount/data/common/anandhu/igbh && bash ) || bash'

arjunsuresh · 2024-12-30T12:54:08Z

@anandhu-eng It should be resolved in this PR

anandhu-eng changed the title ~~env variable(outside cm cache) not being modified when mount to docker~~ Mismatch between dataset path in env of docker run and the benchmark run command Dec 30, 2024

arjunsuresh added a commit to GATEOverflow/mlperf-automations that referenced this issue Dec 30, 2024

Fix env corruption in docker run command, fixes mlcommons#91

fe1d01b

arjunsuresh self-assigned this Dec 30, 2024

arjunsuresh added the bug Something isn't working label Dec 30, 2024

arjunsuresh added this to MLPerf Automation Dec 30, 2024

arjunsuresh added a commit to GATEOverflow/mlperf-automations that referenced this issue Dec 30, 2024

Fix env corruption in docker run command, fixes mlcommons#91

ecd1164

arjunsuresh closed this as completed in 477f80f Jan 3, 2025

github-project-automation bot moved this to Done in MLPerf Automation Jan 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between dataset path in env of docker run and the benchmark run command #91

Mismatch between dataset path in env of docker run and the benchmark run command #91

anandhu-eng commented Dec 30, 2024 •

edited

Loading

anandhu-eng commented Dec 30, 2024

anandhu-eng commented Dec 30, 2024

arjunsuresh commented Dec 30, 2024

Mismatch between dataset path in env of docker run and the benchmark run command #91

Mismatch between dataset path in env of docker run and the benchmark run command #91

Comments

anandhu-eng commented Dec 30, 2024 • edited Loading

anandhu-eng commented Dec 30, 2024

anandhu-eng commented Dec 30, 2024

arjunsuresh commented Dec 30, 2024

anandhu-eng commented Dec 30, 2024 •

edited

Loading