Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Duplicate] "IndexAllocation-Reference closed" error on search workload #2344

Open
owenhalpert opened this issue Dec 18, 2024 · 2 comments
Open
Labels
duplicate This issue or pull request already exists

Comments

@owenhalpert
Copy link
Contributor

owenhalpert commented Dec 18, 2024

What is the bug?

When running the vectorsearch OpenSearch benchmark workload on a constrained system (3GB memory, single node), repeated index/search workloads will occasionally cause node drops.

Other times, I am seeing the following error:
IndexAllocation-Reference error [ERROR] search_phase_execution_exception ({'error': {'root_cause': [{'type': 'illegal_state_exception', 'reason': "IndexAllocation-Reference is already closed can't increment refCount current count [0]"}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'target_index', 'node': 'jgqwrThQTvuDDPQfxXZo_g', 'status': 500}) without a node drop. The workload succeeds with a small error rate.

How can one reproduce the bug?
Steps to reproduce the behavior:
Docker-compose.yml (based on the sample docker-compose.yml given in the OpenSearch docs, but with restricted memory):

services:
  opensearch-node1: # This is also the hostname of the container within the Docker network (i.e. https://opensearch-node1/)
    image: opensearchproject/opensearch:2.18.0 # Specifying the latest available image - modify if you want a specific version
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.type=single-node
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
      - OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
    deploy:
      resources:
        limits:
          memory: 3.0GB
    ulimits:
      memlock:
        soft: -1 # Set memlock to unlimited (no soft or hard limit)
        hard: -1
      nofile:
        soft: 65536 # Maximum number of open files for the opensearch user - set to at least 65536
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data # Creates volume called opensearch-data1 and mounts it to the container
    ports:
      - 9200:9200 # REST API
      - 9600:9600 # Performance Analyzer
    networks:
      - opensearch-net # All of the containers will join the same Docker bridge network
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:2.18.0 # Make sure the version of opensearch-dashboards matches the version of opensearch installed on other nodes
    container_name: opensearch-dashboards
    deploy:
      resources:
        limits:
          memory: 500MB
    ports:
      - 5601:5601 # Map host port 5601 to container port 5601
    expose:
      - "5601" # Expose port 5601 for web access to OpenSearch Dashboards
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200"]' # Define the OpenSearch nodes that OpenSearch Dashboards will query
    networks:
      - opensearch-net

volumes:
  opensearch-data1:

networks:
  opensearch-net:

Params (faiss-sift-128-l2.json from the sample params with search_clients, id_field_name, and docvalue_fields added or updated):

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 1,
    "target_index_dimension": 128,
    "target_index_space_type": "l2",
    "search_clients":8,
    "id_field_name": "id",

    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_path": "sift-128-euclidean.hdf5",
    "target_index_bulk_indexing_clients": 10,

    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 300,
    "hnsw_ef_search": 100,
    "hnsw_ef_construction": 100,

    "query_k": 100,
    "query_body": {
         "docvalue_fields" : ["id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_path":"sift-128-euclidean.hdf5",
    "query_count": 100
  }

Commands:

docker-compose up -d

curl -k -X PUT "https://localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -u 'admin:admin' -d '{
  "persistent": {
    "knn.cache.item.expiry.minutes": "1m"
  }
}'

#Index, search, index, search
opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \ #path to params file
--test-procedure=no-train-test-index-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=search-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=no-train-test-index-only \
--pipeline=benchmark-only \
--kill-running-processes

opensearch-benchmark execute-test \
--workload=vectorsearch \
--target-hosts=https://localhost:9200 \
--client-options=basic_auth_user:admin,basic_auth_password:admin,verify_certs:false \
--workload-params=faiss-sift-128-l2.json \
--test-procedure=search-only \
--pipeline=benchmark-only \
--kill-running-processes

What is the expected behavior?
Queries will complete without error, or the node will drop

What is your host/environment?

  • OS: Apple M3 Pro, Sequoia
  • Version: 2.18
  • Plugins: k-NN

Do you have any additional context?

  1. This is fairly inconsistent, the error occurs about 50% of the time.
  2. OpenSearch code pointer: https://github.com/owenhalpert/OpenSearch/blob/c557f2717ad45627cacd88e8243893dd84a56623/libs/common/src/main/java/org/opensearch/common/util/concurrent/AbstractRefCounted.java#L85
@owenhalpert owenhalpert added bug Something isn't working untriaged labels Dec 18, 2024
@navneet1v navneet1v moved this from Backlog to 2.19.0 in Vector Search RoadMap Dec 24, 2024
@navneet1v navneet1v added duplicate This issue or pull request already exists and removed bug Something isn't working labels Dec 24, 2024
@navneet1v navneet1v changed the title [BUG] "IndexAllocation-Reference closed" error on search workload [Duplicate] "IndexAllocation-Reference closed" error on search workload Dec 24, 2024
@navneet1v
Copy link
Collaborator

navneet1v commented Dec 24, 2024

Similar issue reported here: #2262

@navneet1v
Copy link
Collaborator

@Gankris96 can you please take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
duplicate This issue or pull request already exists
Projects
Status: 2.19.0
Development

No branches or pull requests

2 participants