-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Pagination in Hybrid query #933
Comments
Quite detailed RFC. Waiting for performance benchmarks with new attribute for pagination. |
These are exciting developments--also looking forward to seeing the benchmarks. |
This is an amazing PR description! |
please add more detailed description of |
Is there a documentation request/PR for this feature? |
haven't found one, feel free to close it if it's a duplicate: opensearch-project/documentation-website#8952 |
We would like to understand more about pagination_depth. How does it impact performance? Does setting pagination_depth to a very high value lead to latency issues similar to fetching all hits? or just only introduce latency from score normalization proportional to its size? |
@safakkbilici logically the number of hits fetched from each shard will be equal to pagination_depth. Consider if 3 shards are there and each shard has equal distribution of data and pagination_depth is 10. Then total 3 x 10=30 hits will be catered at coordinator node. Now if you increase the pagination_depth to insanely higher number then it will increase the latency. To justify this, for deeper pagination performance will be hit for sure. But for top pages the performance will be almost same as of normal hybrid query. I will be doing a benchmarking soon to get a rough idea till what extent of pagination the performance stays the same. I will be writing a blog on this. |
Introduction
This document talks about the design to support pagination functionality in hybrid query. OpenSearch supports multiple pagination techniques, and this document primarily focuses on two key techniques: the from and size parameters technique and the scroll operation technique.
Problem Statement
Pagination is an ability to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time. Currently, hybrid query users cannot navigate in the result set as per their requirements.
Terminologies and key concepts
Here we define the key terms and concepts that are used throughout this document to ensure clarity and consistency in understanding the design approach later.
The from and size parameters
The from parameter is the document number from which you want to start showing the results. The size parameter is the number of results that you want to show. Together, they let you return a subset of the search results.
Scroll operation
It allows users to paginate over large set of data in a single search request. Think of it like a live cursor on a database.
If you need to request volumes of data larger than 1 PB from, for example, a machine learning job, use the scroll operation instead. The scroll operation allows you to request an unlimited number of results.
The important thing to note about the scroll operation is that scoring of the result set is not performed. It deals with large sets of data, and scoring such a large amount of data will lead to system outage.
Understanding search workflow in context of from and size parameters
The search process in OpenSearch for traditional queries executes primarily in two phases: Query Phase and Fetch Phase. The Query Phase is responsible for retrieving the query results from the shard as objects containing docId and score. The maximum number of query results collected during the Query Phase is determined by the sum of from and size parameters provided in the search request. The Fetch Phase is responsible for firstly trimming the topN query results obtained from the Query Phase where topN is equal to the value of from parameter. The source of the remaining query results are fetched from the shards. By changing the values of from and size params, users can navigate to another page of results.
Current Architecture
Hybrid query workflow
In the current hybrid query workflow, each sub-query executes the search process and returns the from + size equivalent number of results to the co-ordinator node. In the next step, normalization processor combines the sub-query result on each shard by using normalization and score-combination techniques. Later, the Fetch phase will combine all results from the shards and will return the search response. The number of hits present in the search response will be equal to the size parameter sent in the search request. In the entire hybrid query search process, the default value of from is 0.
However, when customized from and size values are sent to retrieve another page of search results, it fails to do so. This failure occurs due to change in ground truth of hybrid query results.
Why ground truth changes when from > 0
Increasing from value will lead to increase in number of results retrieved for each subquery from a shard. Instead of the new results being added to the last page, they tend to appear earlier in the list, which changes the ground truth. This can happen in two scenarios:
Requirements
Functional Requirements
Non Functional Requirements
Solution Overview
The proposal is to introduce a new parameter in the hybrid query clause, which will provide information on number of individual sub-query results to be retrieved from the shard. This new parameter will ensure that specified number of subquery results are retrieved, even when from and size values are changed. This process helps in maintaining the consistency by not adding new result entries when from > 0 is sent in the search request. Once the subquery results are retrieved, the rest of the hybrid query process remains unchanged. The from value will navigate to the requested page in the final result, and the size parameter will determine how many search results should be returned in the response.
Why this new parameter can only be added in the hybrid query clause?
Following places have been considered to add this new parameter
Sending it in the URL of the search request
It requires enhancement in SearchSourceBuilder of OpenSearch core to inculcate the changes related to the new parameter. Moreover, there is no use of this new parameter for other traditional query types as they are already supporting pagination functionality. This scope of this change is limited to hybrid query logic, therefore this option is not recommended.
Cluster settings or Index settings
For every hybrid query request, an additional call will be made to the OpenSearch core to get index or cluster settings. This will create overload and resource exhaustion in the cluster. Moreover, customers need to change the setting every time when they want to paginate on higher number of results. Additionally, settings change is captured by event listeners that triggers workflows for other important functionalities in OpenSearch. Therefore, this option is not recommended.
Sending it in the hybrid query clause [RECOMMENDED]
Adding in the hybrid query clause is recommended because the changes done by the new parameter are specific to hybrid query and do not impact other functionalities of OpenSearch. It helps customers to understand the results formation better by explicitly defining how many subquery results they want to process under hybrid query logic.
Changes in customer experience
The from and size parameters technique
From the customer’s perspective, the new parameter will act as an indicator on how deeply they want to apply hybridization of subquery results and support pagination on top of it.
The following example visualizes the addition of new parameter in the hybrid query clause. The query below retrieves up to 100 results of each subquery and applies normalization process on them. Then it returns the search response that contains 10 hits. The first hit in the response is the 5th entry after combining all the subquery results.
The new parameter will be referenced as “pagination_depth” in the rest of document for explanatory purposes. There is open question below for the appropriate name suggestion.
Note: The default value of the pagination_depth will be 0. If the customer doesn’t want to apply pagination and just wants to perform the standard hybrid search, then they do not have to provide the depth information. In that case, the hybrid query will use the current formula of from + size for retrieving the shard results.
The scroll operation
Pagination using scroll api cannot be performed in hybrid query process. The reason being hybrid query needs scoring of documents and scroll api does not support it. The main use case of scroll operation is when users need to reindex the data to a new index. This use case can be easily achieved by any other traditional query. Therefore, an illegal argument exception will be thrown when user tries to apply scroll operation with hybrid query.
High level design
(red arrows == current behavior, green arrows == new flow)
Low level design
The interfaces in the following classes will be updated to inculcate the addition of new parameter.
As there are no component changes in the current hybrid query workflow, therefore flow diagram is not added.
Testability
Backward Compatibility
The changes made to inculcate the new parameter addition are backward compatible. If pagination_depth is not provided in the search request then the current formula of from+size will be used for the subquery results retrieval.
Unit and Integration Tests
Both the test will be added in the PR.
Open questions
Q) Possible names of the new parameter to be added?
Add the suggested names in the following list.
Appendix
What is Pagination?
It is a process to divide a large set of search results into smaller subsets, known as pages, and retrieving them one at a time.
There are four different type of pagination techniques supported by OpenSearch:
To know more about pagination and its techniques, please read the OpenSearch documentation on it.
Duplicates
If a result entry is part of another subquery result set, it is considered a duplicate result.
For example, consider a scenario where a user sends a hybrid query request. The hybrid query contains two subqueries, Subquery A and Subquery B. If the result entry of Subquery A is also part of Subquery B’s results then it is considered a duplicate result.
Unique results
If a result entry is part of only one subquery, it is considered as unique result.
The text was updated successfully, but these errors were encountered: