Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenSearch performance 2.17 blog #3470

Merged
merged 8 commits into from
Nov 27, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Editorial comments
Signed-off-by: Fanit Kolchina <kolchfa@amazon.com>
  • Loading branch information
kolchfa-aws committed Nov 27, 2024
commit 833914cb5829555b2d2c65124c554913d2e9fb07
16 changes: 8 additions & 8 deletions _posts/2024-11-26-opensearch-performance-2.17.md
Original file line number Diff line number Diff line change
@@ -15,7 +15,7 @@
meta_keywords: OpenSearch performance progress 2.17, OpenSearch roadmap
meta_description: Learn more about the strategic enhancements and performance features that OpenSearch has delivered up to version 2.17.
has_science_table: true
excerpt: Learn more about the strategic enhancements and performance features that OpenSearch has delivered through version 2.17.
excerpt: Learn more about the strategic enhancements and performance features that OpenSearch has delivered up to version 2.17.
featured_blog_post: false
featured_image: false
---
@@ -24,9 +24,9 @@

The wide range of applications that OpenSearch supports means that no one number can summarize the improvements you'll see in your applications. That's why we're reporting on a variety of performance metrics, some mostly relevant to analytics in general and log analytics in particular, others mostly relevant to lexical search, and still others relevant to semantic search using vector embeddings and k-NN. Under the rubric of performance, we're also including improvements in resource utilization, notably RAM and disk.

Overall, OpenSearch 2.17 delivers a 6x performance improvement over OpenSearch 1.3, with gains across essential operations such as text queries, term aggregations, range queries, date histograms, and sorting. And that's not even counting improvements to semantic vector search, which is now highly configurable in order to let you choose the ideal balance of response time, accuracy, and cost for your applications. All these improvements reflect the contributions and collaboration of a dedicated community, whose insights and efforts drive OpenSearch forward.
Overall, OpenSearch 2.17 delivers a 6x performance improvement over OpenSearch 1.3, with gains across essential operations such as text queries, terms aggregations, range queries, date histograms, and sorting. And that's not even counting improvements to semantic vector search, which is now highly configurable in order to let you choose the ideal balance of response time, accuracy, and cost for your applications. All these improvements reflect the contributions and collaboration of a dedicated community, whose insights and efforts drive OpenSearch forward.

This post highlights the performance improvements in OpenSearch 2.17. The first section focuses on key query operations, including text queries, term aggregations, range queries, date histograms, and sorting. These improvements were evaluated using the [OpenSearch Big5 workload](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/big5), which represents common use cases in both search and analytics applications. The benchmarks provide a repeatable framework for measuring real-world performance enhancements. The next section reports on vector search improvements. Finally, we present our roadmap for 2025, where you'll see that we're making qualitative improvements in many areas, in addition to important incremental changes. We are improving query speed by processing data in real time. We are building a query planner that uses resources more efficiently. We are speeding up intra-cluster communications. And we're adding efficient join operations to query domain-specific language (DSL), Piped Processing Language (PPL), and SQL. To follow our work in more detail, and to contribute comments or code, please participate on the [OpenSearch forum](https://forum.opensearch.org/) as well as directly in our GitHub repos.
This post highlights the performance improvements in OpenSearch 2.17. The first section focuses on key query operations, including text queries, terms aggregations, range queries, date histograms, and sorting. These improvements were evaluated using the [OpenSearch Big5 workload](https://github.com/opensearch-project/opensearch-benchmark-workloads/tree/main/big5), which represents common use cases in both search and analytics applications. The benchmarks provide a repeatable framework for measuring real-world performance enhancements. The next section reports on vector search improvements. Finally, we present our roadmap for 2025, where you'll see that we're making qualitative improvements in many areas, in addition to important incremental changes. We are improving query speed by processing data in real time. We are building a query planner that uses resources more efficiently. We are speeding up intra-cluster communications. And we're adding efficient join operations to query domain-specific language (DSL), Piped Processing Language (PPL), and SQL. To follow our work in more detail, and to contribute comments or code, please participate on the [OpenSearch forum](https://forum.opensearch.org/) as well as directly in our GitHub repos.

<style>
.green-clr {
@@ -83,7 +83,7 @@
The following table summarizes performance improvements for the preceding query types.

| |
**Query Types** |1.3.18 |2.7 |2.11 |2.12 |2.13 |2.14 |2.15 |2.16 |2.17 |

Check failure on line 86 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.TableHeadings] 'Query Types' is a table heading and should be in sentence case. Raw Output: {"message": "[OpenSearch.TableHeadings] 'Query Types' is a table heading and should be in sentence case.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 86, "column": 3}}}, "severity": "ERROR"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**Query Types** |1.3.18 |2.7 |2.11 |2.12 |2.13 |2.14 |2.15 |2.16 |2.17 |
**Query types** |1.3.18 |2.7 |2.11 |2.12 |2.13 |2.14 |2.15 |2.16 |2.17 |

|--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |--- |
|
**Big 5 areas mean latency, ms
@@ -97,9 +97,9 @@
Range Queries |26.08 |23.12 |16.91 |18.71 |17.33 |17.39 |18.51 |3.17 |3.17 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Range Queries |26.08 |23.12 |16.91 |18.71 |17.33 |17.39 |18.51 |3.17 |3.17 |
Range queries |26.08 |23.12 |16.91 |18.71 |17.33 |17.39 |18.51 |3.17 |3.17 |

|
Date Histogram |6068 |5249 |5168 |469 |357 |146 |157 |164 |160 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Date Histogram |6068 |5249 |5168 |469 |357 |146 |157 |164 |160 |
Date histogram |6068 |5249 |5168 |469 |357 |146 |157 |164 |160 |

|Aggregate (geo mean) |195.96 |154.59 |130.9 |74.85 |51.84 |43.44 |37.07 |24.66 |24.63 |

Check failure on line 100 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 100, "column": 13}}}, "severity": "ERROR"}
|Speedup factor, compared to OS 1.3 (geo mean) |1.0 |1.27 |1.50 |2.62 |3.78 |4.51 |5.29 |7.95 |7.96 |

Check failure on line 101 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 101, "column": 38}}}, "severity": "ERROR"}
|Relative latency, compared to OS 1.3 (geo mean) |100% |78.89 |66.80 |38.20 |26.45 |22.17 |18.92 |12.58 |15.93 |

Check failure on line 102 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: geo. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 102, "column": 40}}}, "severity": "ERROR"}

For a detailed benchmark analysis or to run your own benchmarks, see the [Appendix](#appendix---benchmarking-tests-and-results).

@@ -113,11 +113,11 @@

With OpenSearch 2.17, we further amplified these performance gains. Building on the foundation of the **match_only_text** field, OpenSearch 2.17 optimizes text queries, achieving **21% faster performance compared to 2.14** and **63% faster performance compared to 1.3**. These improvements stem from continued enhancements to query execution and index optimization. Applications relying on text search for analytics or high-recall use cases can now achieve faster results with reduced resource usage, making OpenSearch 2.17 an even more powerful choice for modern text search workloads.

### Term and multi-term aggregations
### Terms and multi-terms aggregations

Term aggregations are crucial for slicing large datasets based on multiple criteria, making them important query operations for data analytics use cases. Building on prior advancements, OpenSearch 2.17 enhances the efficiency of global term aggregations, using term frequency optimizations to handle large immutable collections, such as log data, with unprecedented speed.
Terms aggregations are crucial for slicing large datasets based on multiple criteria, making them important query operations for data analytics use cases. Building on prior advancements, OpenSearch 2.17 enhances the efficiency of global terms aggregations, using term frequency optimizations to handle large immutable collections, such as log data, with unprecedented speed.

Performance benchmarks demonstrate a **61% performance improvement compared to OpenSearch 2.14** and an overall **81% reduction in query latency compared to OpenSearch 1.3**, while **multi-term aggregation queries demonstrate up to a 20% reduction in latency**. Additionally, memory efficiency is improved dramatically, with a **50--60% reduction in memory footprint for short-lived objects** because new byte array allocations for composite key storage are not needed.
Performance benchmarks demonstrate a **61% performance improvement compared to OpenSearch 2.14** and an overall **81% reduction in query latency compared to OpenSearch 1.3**, while **multi-terms aggregation queries demonstrate up to a 20% reduction in latency**. Additionally, memory efficiency is improved dramatically, with a **50--60% reduction in memory footprint for short-lived objects** because new byte array allocations for composite key storage are not needed.

OpenSearch 2.17 also introduced support for the **[wildcard field type](https://github.com/opensearch-project/OpenSearch/pull/13461)**, enabling highly efficient execution of wildcard, prefix, and regular expression queries. This new field type uses trigrams (or bigrams and individual characters) to match patterns before applying a post-filtering step to evaluate the original field, resulting in faster and more efficient query execution.

@@ -146,7 +146,7 @@

## Vector search

**Disk-optimized vector search**: The OpenSearch vector engine continues to prioritize cost savings in the 2.17 release. This release introduced disk-optimized vector search, allowing you to use the full potential of vector workloads, even in low-memory environments. Disk-optimized vector search is designed to provide out-of-the-box **32x compression** when using binary quantization, a powerful compression technique. Additionally, you have the flexibility to fine-tune costs, response time, and accuracy to your unique needs through configurable parameters such as compression rate, sampling, and rescoring. According to internal benchmarks, OpenSearch's disk-optimized vector search can deliver cost savings of up to 70% while maintaining p90 latencies of around 200 ms and recall of over 0.9. For more information, see [Disk-based vector search](https://opensearch.org/docs/latest/search-plugins/knn/disk-based-vector-search/).

Check failure on line 149 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: rescoring. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: rescoring. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 149, "column": 602}}}, "severity": "ERROR"}

**Cost improvements by reducing memory footprint**: Vector search capabilities in native engines (Faiss and NMSLIB) received a significant boost in OpenSearch 2.17. In this version, OpenSearch's byte compression technique is extended to the Faiss Engine [HSNW](https://github.com/opensearch-project/k-NN/pull/1823) and [IVF](https://github.com/opensearch-project/k-NN/pull/2002) algorithms to further reduce memory footprint by up to 75% for vectors within byte range ([-128, 127]). These provide an additional 25% memory footprint savings compared to OpenSearch 2.14 with [FP16 quantization](https://opensearch.org/blog/optimizing-opensearch-with-fp16-quantization/) and an overall savings of up to 85% compared to OpenSearch 1.3.

@@ -166,7 +166,7 @@
* **[Native join support](https://github.com/opensearch-project/OpenSearch/issues/15185)**: We're introducing efficient join operations across indexes that will be natively supported and fully integrated with OpenSearch's query DSL, PPL, and SQL.
* **Native vectorized processing**: By using modern CPU SIMD operations and native code, we're optimizing the processing of data streams to eliminate Java's garbage collection bottlenecks.
* **[Smarter query planning](https://github.com/opensearch-project/OpenSearch/issues/12390)**: Optimizing where and how computations run will ensure reduced unnecessary data transfer and improve performance for parallel query execution.
* **[gRPC-based Search API](https://github.com/opensearch-project/OpenSearch/issues/15190)**: We're enhancing client-server communication with Protobuf and gRPC, accelerating search by reducing overhead.

Check failure on line 169 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: Protobuf. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: Protobuf. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 169, "column": 144}}}, "severity": "ERROR"}
* **[Query performance optimization](https://github.com/orgs/opensearch-project/projects/153)**: Improving performance remains our consistent priority, and several key initiatives, such as docID encoding and query approximation, will reduce index size and enhance the performance of large-range queries.
* **[Star-tree indexing](https://github.com/opensearch-project/OpenSearch/issues/12498)**: Precomputing aggregations using star-tree indexing will ensure faster, more predictable performance for aggregation-heavy queries.

@@ -177,7 +177,7 @@
* **Index build acceleration with GPUs and SIMD:** k-NN performance can be enhanced by using libraries with GPU support. Because vector distance calculations are compute-heavy, GPUs can speed up computations and reduce index build and search query times.
* **Autotuning k-NN indexes:** OpenSearch's vector database offers a toolkit of algorithms tailored for diverse workloads. In 2025, our goal is to enhance the out-of-the-box experience by autotuning hyperparameters and settings based on access patterns and hardware resources.
* **Cold-warm tiering:** In version 2.18, we added support for enabling vector search on remote snapshots. We will continue focusing on decoupling index read/write operations to extend vector indexes to different storage systems in order to reduce storage and compute costs.
* **Memory footprint reduction:** We will continue to aggressively reduce the memory footprint of vector indexes. One of our goals is to support the ability to partially load HNSW indexes into native engines. This complements our disk-based-optimized search and helps further reduce the operating costs of OpenSearch clusters.
* **Memory footprint reduction:** We will continue to aggressively reduce the memory footprint of vector indexes. One of our goals is to support the ability to partially load HNSW indexes into native engines. This complements our disk-optimized search and helps further reduce the operating costs of OpenSearch clusters.
* **Reduced disk storage with "derived source":** Currently, vector data is stored both in a doc-values-like format and in the stored `_source` field. The stored `_source` field can contribute more than 60% of the overall vector storage requirement. We plan to create a custom stored field format that will inject the vector fields into the source from the doc-values-like format. In addition to storage savings, this will have the secondary effects of improved indexing throughput, lighter shards, and even faster search.

### Neural search
@@ -187,7 +187,7 @@
Our 2025 roadmap emphasizes optimizing performance, enhancing functionality, and simplifying adoption. Key initiatives include:

- **Improving hybrid query performance**: Reduce latency by up to 25%.
- **Introducing explainability for hybrid queries**: Provide insights into how each subquery result contributes to the final hybrid query result, enabling better debugging and performance analysis.

Check failure on line 190 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: explainability. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: explainability. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 190, "column": 17}}}, "severity": "ERROR"}
- **Supporting additional algorithms for combining hybrid query results**: Support algorithms like reciprocal rank fusion (RRF), which improves hybrid search latency by avoiding costly score normalization because the scores are rank based.
- **Enhancing neural sparse pruning strategies**: Apply techniques such as pruning by weight, by ratio with max weight, by top-k, and by alpha-mass to improve performance by 20%.
- **Optimizing inference calls during updates and reindexing**: Reduce the number of inference calls required for neural and sparse ingestion pipelines, increasing throughput by 20% for these operations.
@@ -200,7 +200,7 @@

OpenSearch continues to evolve, not only by expanding functionality but also by significantly enhancing performance, efficiency, and scalability across diverse workloads. OpenSearch 2.17 exemplifies the community's commitment, delivering improvements in query speed, resource utilization, and memory efficiency across text queries, aggregations, range queries, and time-series analytics. These advancements underscore our dedication to optimizing OpenSearch for real-world use cases.

Key innovations like disk-optimized vector search and enhancements to term and multi-term aggregations demonstrate our focus on staying at the forefront of vector search and analytics technology. Additionally, OpenSearch 2.17's improvements to hybrid and vector search, combined with roadmap plans for streaming architecture, gRPC APIs, and smarter query planning, highlight our forward-looking strategy for meeting the demands of modern workloads.
Key innovations like disk-optimized vector search and enhancements to terms and multi-terms aggregations demonstrate our focus on staying at the forefront of vector search and analytics technology. Additionally, OpenSearch 2.17's improvements to hybrid and vector search, combined with roadmap plans for streaming architecture, gRPC APIs, and smarter query planning, highlight our forward-looking strategy for meeting the demands of modern workloads.

These achievements are made possible through collaboration with the broader OpenSearch community, whose contributions to testing, feedback, and development have been invaluable. Together, we are building a robust and efficient search and analytics engine capable of addressing current and future challenges.

@@ -236,8 +236,8 @@
|query-string-on-message-filtered |2 |67.25 |47 |30.25 |46.5 |47.5 |46 |46.75 |29.5 |30 |
|query-string-on-message-filtered-sorted-num |3 |125.25 |102 |85.5 |41 |41.25 |41 |40.75 |24 |24.5 |
|term |4 |4 |3.75 |4 |4 |4 |4 |4 |4 |4 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Term" (capitalized)?

|Sorting |asc_sort_timestamp |5 |9.75 |15.75 |7.5 |7 |7 |7 |7 |7 |7 |

Check failure on line 239 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: asc_sort_timestamp. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: asc_sort_timestamp. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 239, "column": 11}}}, "severity": "ERROR"}
|asc_sort_timestamp_can_match_shortcut |6 |13.75 |7 |7 |6.75 |6 |6.25 |6.5 |6 |6.25 |

Check failure on line 240 in _posts/2024-11-26-opensearch-performance-2.17.md

GitHub Actions / style-job

[vale] reported by reviewdog 🐶 [OpenSearch.Spelling] Error: asc_sort_timestamp_can_match_shortcut. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks. Raw Output: {"message": "[OpenSearch.Spelling] Error: asc_sort_timestamp_can_match_shortcut. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_posts/2024-11-26-opensearch-performance-2.17.md", "range": {"start": {"line": 240, "column": 2}}}, "severity": "ERROR"}
|asc_sort_timestamp_no_can_match_shortcut |7 |13.5 |7 |7 |6.5 |6 |6 |6.5 |6 |6.25 |
|asc_sort_with_after_timestamp |8 |35 |33.75 |238 |212 |197.5 |213.5 |204.25 |160.5 |185.25 |
|desc_sort_timestamp |9 |12.25 |39.25 |6 |7 |5.75 |5.75 |5.75 |6 |6 |