Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

single positional indexer is out-of-bounds error with hotpotqa #13

Open
edknv opened this issue Oct 19, 2023 · 0 comments
Open

single positional indexer is out-of-bounds error with hotpotqa #13

edknv opened this issue Oct 19, 2023 · 0 comments

Comments

@edknv
Copy link
Contributor

edknv commented Oct 19, 2023

Running the following produces an error:

import crossfit as cf


if __name__ == "__main__":

    torch_mem = 40
    model_name = "all-MiniLM-L6-v2"
    dataset = "hotpotqa"
    top_k = 100


    model = cf.SentenceTransformerModel(model_name, max_mem_gb=torch_mem)
    vector_search = cf.TorchExactSearch(k=top_k)

    with cf.Distributed(rmm_pool_size=f"{torch_mem}GB", n_workers=2):
        report = cf.beir_report(
            dataset,
            model=model,
            vector_search=vector_search,
            sorted_data_loader=True,
            #tiny_sample=True,
            overwrite=True,
        )

    report.console()

First we get an error with single positional indexer is out-of-bounds:

Deployed LocalCUDACluster(d47d456d, 'tcp://127.0.0.1:45241', workers=2, threads=2, memory=124.43 GiB)...                                                       [0/1343]
Downloading hotpotqa ...
/root/.cf/hotpotqa.zip: 100%|█████████████████| 624M/624M [00:36<00:00, 17.9MiB/s] 
Unzipping hotpotqa ...
2023-10-19 22:34:14,678 - distributed.sizeof - WARNING - Sizeof calculation failed. Defaulting to 0.95 MiB
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/sizeof.py", line 17, in safe_sizeof
    return sizeof(obj)
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/sizeof.py", line 58, in sizeof_python_collection
    return sys.getsizeof(seq) + sum(map(sizeof, seq))                              
  File "/usr/local/lib/python3.10/dist-packages/dask/utils.py", line 642, in __call__
    return meth(arg, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner  
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in sizeof_cudf_dataframe
    sum(col.memory_usage for col in df._data.columns)                              
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/backends.py", line 465, in <genexpr>
    sum(col.memory_usage for col in df._data.columns)                              
  File "/usr/lib/python3.10/functools.py", line 981, in __get__                    
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/struct.py", line 80, in memory_usage
    n += child.memory_usage
  File "/usr/lib/python3.10/functools.py", line 981, in __get__                    
    val = self.func(instance)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/lists.py", line 77, in memory_usage
    ].element_indexing(current_offset)
  File "/usr/local/lib/python3.10/dist-packages/cudf/core/column/column.py", line 548, in element_indexing
    raise IndexError("single positional indexer is out-of-bounds")                 
IndexError: single positional indexer is out-of-bounds                             
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
  warnings.warn("Using CPU via Pandas to write JSON dataset")                      
/usr/local/lib/python3.10/dist-packages/cudf/io/json.py:229: UserWarning: Using CPU via Pandas to write JSON dataset
  warnings.warn("Using CPU via Pandas to write JSON dataset")                      

Even with this error, the process continues, and with tiny_sample=False, the evaluation completes and we get a beir report with the metrics in the end. However, with tiny_sample=True, we get a subsequent error at the end of the evaluation. Perhaps it's related to the first error?

Second error:

Doing vector search: TorchExactSearch...█▉| 52992/53125 [00:05<00:00, 5876.70it/s]
2023-10-19 22:43:28,679 - distributed.worker - WARNING - Compute Failed
Key:       call_part-20d4cc5a-2b7e-4a47-885b-c22dbe5e9366
Function:  call_part
args:      (         index  ...                                          embedding
level_0         ...                                                   
0            0  ...  [0.04964268, -0.0006760926, -0.05950017, -0.03...
1            1  ...  [0.050978776, -0.020911103, 0.033033445, -0.01...
2            2  ...  [0.0049503874, -0.023904875, -0.08124614, -0.0...
3            3  ...  [-0.026340026, 0.03371759, -0.04466627, -0.040...
4            4  ...  [-0.03982742, -0.01967655, -0.105481476, 0.012...
...        ...  ...                                                ...
48921    48921  ...  [0.0070217294, 0.066551566, -0.019821709, 0.02...
48922    48922  ...  [0.06491591, 0.030879924, -0.024998112, -0.026...
48923    48923  ...  [0.069637045, 0.08056582, -0.06847206, 0.03625...
48924    48924  ...  [-0.09423433, 0.09887364, -0.013571036, 0.0046...
48925    48925  ...  [0.0072488617, 0.022699378, -0.032196615, 0.02...

[48926 rows x 3 columns],            index       _id                       
kwargs:    {}
Exception: "RuntimeError('The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.')"

Traceback (most recent call last):
  File "/code/crossfit/beir_metrics.py", line 16, in <module>
    report = cf.beir_report(
  File "/code/crossfit/crossfit/report/beir/report.py", line 182, in beir_report
    embeddings: EmbeddingDatataset = embed(
  File "/code/crossfit/crossfit/report/beir/embed.py", line 72, in embed
    topk_df.to_parquet(pred_path)
  File "/usr/local/lib/python3.10/dist-packages/nvtx/nvtx.py", line 101, in inner
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask_cudf/core.py", line 252, in to_parquet
    return to_parquet(self, path, *args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/dask/dataframe/io/parquet/core.py", line 1062, in to_parquet
    out = out.compute(**compute_kwargs)
  File "/code/crossfit/crossfit/op/vector_search.py", line 46, in call_part
    results, indices = self.search_tensors(query_emb, item_emb)
  File "/code/crossfit/crossfit/backend/torch/op/vector_search.py", line 31, in search_tensors
    sim_scores = score_function(queries, corpus)
  File "/usr/local/lib/python3.10/dist-packages/beir/retrieval/search/dense/util.py", line 24, in cos_sim
    return torch.mm(a_norm, b_norm.transpose(0, 1)) #TODO: this keeps allocating GPU memory
RuntimeError: The tensor has a non-zero number of elements, but its data is not allocated yet. Caffe2 uses a lazy allocation, so you will need to call mutable_data() or raw_mutable_data() to actually allocate memory.
2023-10-19 22:43:28,813 - distributed.worker - ERROR - Failed to communicate with scheduler during heartbeat.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 225, in read
    frames_nbytes = await stream.read_bytes(fmt_size)
tornado.iostream.StreamClosedError: Stream is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/distributed/worker.py", line 1255, in heartbeat
    response = await retry_operation(
  File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 434, in retry_operation
    return await retry(
  File "/usr/local/lib/python3.10/dist-packages/distributed/utils_comm.py", line 413, in retry
    return await coro()
  File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1377, in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/distributed/core.py", line 1136, in send_recv
    response = await comm.read(deserializers=deserializers)
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 241, in read
    convert_stream_closed_error(self, e)
  File "/usr/local/lib/python3.10/dist-packages/distributed/comm/tcp.py", line 144, in convert_stream_closed_error
    raise CommClosedError(f"in {obj}: {exc}") from exc
distributed.comm.core.CommClosedError: in <TCP (closed) ConnectionPool.heartbeat_worker local=tcp://127.0.0.1:35924 remote=tcp://127.0.0.1:39663>: Stream is closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant