[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

owenhalpert · 2024-12-18T21:49:40Z

What is the bug?

When running the vectorsearch OpenSearch benchmark workload on a constrained system (3GB memory, single node), repeated index/search workloads will occasionally cause node drops.

Other times, I am seeing the following error:
IndexAllocation-Reference error [ERROR] search_phase_execution_exception ({'error': {'root_cause': [{'type': 'illegal_state_exception', 'reason': "IndexAllocation-Reference is already closed can't increment refCount current count [0]"}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'target_index', 'node': 'jgqwrThQTvuDDPQfxXZo_g', 'status': 500}) without a node drop. The workload succeeds with a small error rate.

How can one reproduce the bug?
Steps to reproduce the behavior:
Docker-compose.yml (based on the sample docker-compose.yml given in the OpenSearch docs, but with restricted memory):

services:
  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.18.0 # Specifying the latest available image - modify if you want a specific version
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.type=single-node
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    deploy:
      resources:
        limits:
          memory: 3.0GB
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.18.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
    deploy:
      resources:
        limits:
          memory: 500MB
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:

Params (faiss-sift-128-l2.json from the sample params with search_clients, id_field_name, and docvalue_fields added or updated):

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    "search_clients":8,
    "id_field_name": "id",

    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_path": "sift-128-euclidean.hdf5",
    "target_index_bulk_indexing_clients": 10,

    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 300,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_path":"sift-128-euclidean.hdf5",
    "query_count": 100
  }

Commands:

docker-compose up -d

curl -k -X PUT "https://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -u 'admin:admin' -d '{
  "persistent": {
    "knn.cache.item.expiry.minutes": "1m"
  }
}'

#Index, search, index, search
opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \ #path to params file
--test-procedure=no-train-test-index-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=search-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=no-train-test-index-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=search-only \
--pipeline=benchmark-only \
--kill-running-processes

What is the expected behavior?
Queries will complete without error, or the node will drop

What is your host/environment?

OS: Apple M3 Pro, Sequoia
Version: 2.18
Plugins: k-NN

Do you have any additional context?

This is fairly inconsistent, the error occurs about 50% of the time.
OpenSearch code pointer: https://github.com/owenhalpert/OpenSearch/blob/c557f2717ad45627cacd88e8243893dd84a56623/libs/common/src/main/java/org/opensearch/common/util/concurrent/AbstractRefCounted.java#L85

The text was updated successfully, but these errors were encountered:

navneet1v · 2024-12-24T19:34:17Z

Similar issue reported here: #2262

navneet1v · 2024-12-24T19:47:18Z

@Gankris96 can you please take a look

owenhalpert added bug Something isn't working untriaged labels Dec 18, 2024

navneet1v removed the untriaged label Dec 24, 2024

navneet1v added this to Vector Search RoadMap Dec 24, 2024

github-project-automation bot moved this to Backlog in Vector Search RoadMap Dec 24, 2024

navneet1v moved this from Backlog to 2.19.0 in Vector Search RoadMap Dec 24, 2024

navneet1v added duplicate This issue or pull request already exists and removed bug Something isn't working labels Dec 24, 2024

navneet1v changed the title ~~[BUG] "IndexAllocation-Reference closed" error on search workload~~ [Duplicate] "IndexAllocation-Reference closed" error on search workload Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

owenhalpert commented Dec 18, 2024 •

edited

Loading

navneet1v commented Dec 24, 2024 •

edited

Loading

navneet1v commented Dec 24, 2024

[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

Comments

owenhalpert commented Dec 18, 2024 • edited Loading

navneet1v commented Dec 24, 2024 • edited Loading

navneet1v commented Dec 24, 2024

owenhalpert commented Dec 18, 2024 •

edited

Loading

navneet1v commented Dec 24, 2024 •

edited

Loading