You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First we get an error with single positional indexer is out-of-bounds:
Deployed LocalCUDACluster(d47d456d, 'tcp://127.0.0.1:45241', workers=2, threads=2, memory=124.43 GiB)... [0/1343]
Downloading hotpotqa ...
/root/.cf/hotpotqa.zip: 100%|█████████████████| 624M/624M [00:36<00:00, 17.9MiB/s]
Unzipping hotpotqa ...
2023-10-19 22:34:14,678 - distributed.sizeof - WARNING - Sizeof calculation failed. Defaulting to 0.95 MiB
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/distributed/sizeof.py", line 17, in safe_sizeof
return sizeof(obj)
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
return sys.getsizeof(seq) + sum(map(sizeof, seq))
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
return sys.getsizeof(seq) + sum(map(sizeof, seq))
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
return sys.getsizeof(seq) + sum(map(sizeof, seq))
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
return sys.getsizeof(seq) + sum(map(sizeof, seq))
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
return sys.getsizeof(seq) + sum(map(sizeof, seq))
File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
return meth(arg, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in sizeof_cudf_dataframe
sum(col.memory_usage for col in df._data.columns)
File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in <genexpr>
sum(col.memory_usage for col in df._data.columns)
File "/usr/lib/python3.10/functools.py", line 981, in __get__
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/struct.py", line 80, in memory_usage
n += child.memory_usage
File "/usr/lib/python3.10/functools.py", line 981, in __get__
val = self.func(instance)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/lists.py", line 77, in memory_usage
].element_indexing(current_offset)
File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 548, in element_indexing
raise IndexError("single positional indexer is out-of-bounds")
IndexError: single positional indexer is out-of-bounds
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
warnings.warn("Using CPU via Pandas to write JSON dataset")
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
warnings.warn("Using CPU via Pandas to write JSON dataset")
Even with this error, the process continues, and with tiny_sample=False, the evaluation completes and we get a beir report with the metrics in the end. However, with tiny_sample=True, we get a subsequent error at the end of the evaluation. Perhaps it's related to the first error?
Second error:
Doing vector search: TorchExactSearch...█▉| 52992/53125 [00:05<00:00, 5876.70it/s]
2023-10-19 22:43:28,679 - distributed.worker - WARNING - Compute Failed
Key: call_part-20d4cc5a-2b7e-4a47-885b-c22dbe5e9366
Function: call_part
args: ( index ... embedding
level_0 ...
0 0 ... [0.04964268, -0.0006760926, -0.05950017, -0.03...
1 1 ... [0.050978776, -0.020911103, 0.033033445, -0.01...
2 2 ... [0.0049503874, -0.023904875, -0.08124614, -0.0...
3 3 ... [-0.026340026, 0.03371759, -0.04466627, -0.040...
4 4 ... [-0.03982742, -0.01967655, -0.105481476, 0.012...
... ... ... ...
48921 48921 ... [0.0070217294, 0.066551566, -0.019821709, 0.02...
48922 48922 ... [0.06491591, 0.030879924, -0.024998112, -0.026...
48923 48923 ... [0.069637045, 0.08056582, -0.06847206, 0.03625...
48924 48924 ... [-0.09423433, 0.09887364, -0.013571036, 0.0046...
48925 48925 ... [0.0072488617, 0.022699378, -0.032196615, 0.02...
[48926 rows x 3 columns], index _id
kwargs: {}
Exception: "RuntimeError('The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.')"
Traceback (most recent call last):
File "/code/crossfit/beir_metrics.py", line 16, in <module>
report = cf.beir_report(
File "/code/crossfit/crossfit/report/beir/report.py", line 182, in beir_report
embeddings: EmbeddingDatataset = embed(
File "/code/crossfit/crossfit/report/beir/embed.py", line 72, in embed
topk_df.to_parquet(pred_path)
File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
result = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask_cudf/core.py", line 252, in to_parquet
return to_parquet(self, path, *args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/parquet/core.py", line 1062, in to_parquet
out = out.compute(**compute_kwargs)
File "/code/crossfit/crossfit/op/vector_search.py", line 46, in call_part
results, indices = self.search_tensors(query_emb, item_emb)
File "/code/crossfit/crossfit/backend/torch/op/vector_search.py", line 31, in search_tensors
sim_scores = score_function(queries, corpus)
File "/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/dense/util.py", line 24, in cos_sim
return torch.mm(a_norm, b_norm.transpose(0, 1)) #TODO: this keeps allocating GPU memory
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
2023-10-19 22:43:28,813 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 225, in read
frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/distributed/worker.py", line 1255, in heartbeat
response = await retry_operation(
File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 434, in retry_operation
return await retry(
File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 413, in retry
return await coro()
File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1377, in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1136, in send_recv
response = await comm.read(deserializers=deserializers)
File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 241, in read
convert_stream_closed_error(self, e)
File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:35924 remote=tcp://127.0.0.1:39663>: Stream is closed
The text was updated successfully, but these errors were encountered:
Running the following produces an error:
First we get an error with
single positional indexer is out-of-bounds
:Even with this error, the process continues, and with
tiny_sample=False
, the evaluation completes and we get a beir report with the metrics in the end. However, withtiny_sample=True
, we get a subsequent error at the end of the evaluation. Perhaps it's related to the first error?Second error:
The text was updated successfully, but these errors were encountered: