single positional indexer is out-of-bounds error with hotpotqa #13

edknv · 2023-10-19T22:46:31Z

Running the following produces an error:

import crossfit as cf


if __name__ == "__main__":

    torch_mem = 40
    model_name = "all-MiniLM-L6-v2"
    dataset = "hotpotqa"
    top_k = 100


    model = cf.SentenceTransformerModel(model_name, max_mem_gb=torch_mem)
    vector_search = cf.TorchExactSearch(k=top_k)

    with cf.Distributed(rmm_pool_size=f"{torch_mem}GB", n_workers=2):
        report = cf.beir_report(
            dataset,
            model=model,
            vector_search=vector_search,
            sorted_data_loader=True,
            #tiny_sample=True,
            overwrite=True,
        )

    report.console()

First we get an error with single positional indexer is out-of-bounds:

Deployed LocalCUDACluster(d47d456d, 'tcp://127.0.0.1:45241', workers=2, threads=2, memory=124.43 GiB)...                                                       [0/1343]
Downloading hotpotqa ...
/root/.cf/hotpotqa.zip: 100%|█████████████████| 624M/624M [00:36<00:00, 17.9MiB/s] 
Unzipping hotpotqa ...
2023-10-19 22:34:14,678 - distributed.sizeof - WARNING - Sizeof calculation failed. Defaulting to 0.95 MiB
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/sizeof.py", line 17, in safe_sizeof
    return sizeof(obj)
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner  
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in sizeof_cudf_dataframe
    sum(col.memory_usage for col in df._data.columns)                              
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in <genexpr>
    sum(col.memory_usage for col in df._data.columns)                              
  File "/usr/lib/python3.10/functools.py", line 981, in __get__                    
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/struct.py", line 80, in memory_usage
    n += child.memory_usage
  File "/usr/lib/python3.10/functools.py", line 981, in __get__                    
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/lists.py", line 77, in memory_usage
    ].element_indexing(current_offset)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 548, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")                 
IndexError: single positional indexer is out-of-bounds                             
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
  warnings.warn("Using CPU via Pandas to write JSON dataset")                      
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
  warnings.warn("Using CPU via Pandas to write JSON dataset")

Even with this error, the process continues, and with tiny_sample=False, the evaluation completes and we get a beir report with the metrics in the end. However, with tiny_sample=True, we get a subsequent error at the end of the evaluation. Perhaps it's related to the first error?

Second error:

Doing vector search: TorchExactSearch...█▉| 52992/53125 [00:05<00:00, 5876.70it/s]
2023-10-19 22:43:28,679 - distributed.worker - WARNING - Compute Failed
Key:       call_part-20d4cc5a-2b7e-4a47-885b-c22dbe5e9366
Function:  call_part
args:      (         index  ...                                          embedding
level_0         ...                                                   
0            0  ...  [0.04964268, -0.0006760926, -0.05950017, -0.03...
1            1  ...  [0.050978776, -0.020911103, 0.033033445, -0.01...
2            2  ...  [0.0049503874, -0.023904875, -0.08124614, -0.0...
3            3  ...  [-0.026340026, 0.03371759, -0.04466627, -0.040...
4            4  ...  [-0.03982742, -0.01967655, -0.105481476, 0.012...
...        ...  ...                                                ...
48921    48921  ...  [0.0070217294, 0.066551566, -0.019821709, 0.02...
48922    48922  ...  [0.06491591, 0.030879924, -0.024998112, -0.026...
48923    48923  ...  [0.069637045, 0.08056582, -0.06847206, 0.03625...
48924    48924  ...  [-0.09423433, 0.09887364, -0.013571036, 0.0046...
48925    48925  ...  [0.0072488617, 0.022699378, -0.032196615, 0.02...

[48926 rows x 3 columns],            index       _id                       
kwargs:    {}
Exception: "RuntimeError('The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.')"

Traceback (most recent call last):
  File "/code/crossfit/beir_metrics.py", line 16, in <module>
    report = cf.beir_report(
  File "/code/crossfit/crossfit/report/beir/report.py", line 182, in beir_report
    embeddings: EmbeddingDatataset = embed(
  File "/code/crossfit/crossfit/report/beir/embed.py", line 72, in embed
    topk_df.to_parquet(pred_path)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/core.py", line 252, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/parquet/core.py", line 1062, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/code/crossfit/crossfit/op/vector_search.py", line 46, in call_part
    results, indices = self.search_tensors(query_emb, item_emb)
  File "/code/crossfit/crossfit/backend/torch/op/vector_search.py", line 31, in search_tensors
    sim_scores = score_function(queries, corpus)
  File "/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/dense/util.py", line 24, in cos_sim
    return torch.mm(a_norm, b_norm.transpose(0, 1)) #TODO: this keeps allocating GPU memory
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
2023-10-19 22:43:28,813 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/worker.py", line 1255, in heartbeat
    response = await retry_operation(
  File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 434, in retry_operation
    return await retry(
  File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 413, in retry
    return await coro()
  File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1377, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1136, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:35924 remote=tcp://127.0.0.1:39663>: Stream is closed

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

single positional indexer is out-of-bounds error with hotpotqa #13

single positional indexer is out-of-bounds error with hotpotqa #13

edknv commented Oct 19, 2023

single positional indexer is out-of-bounds error with hotpotqa #13

single positional indexer is out-of-bounds error with hotpotqa #13

Comments

edknv commented Oct 19, 2023