Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Common Filter Support for Hybrid Query Sub-Queries #1135

Open
bzhangam opened this issue Jan 22, 2025 · 0 comments
Open

[RFC] Common Filter Support for Hybrid Query Sub-Queries #1135

bzhangam opened this issue Jan 22, 2025 · 0 comments
Assignees

Comments

@bzhangam
Copy link
Contributor

bzhangam commented Jan 22, 2025

Introduction

This document proposes a solution to support common filter for the sub-queries of the hybrid query in Hybrid Search.

Problem Statement

Currently hybrid query clause doesn't support a simple way to add a common filter to its sub-queries which leads to adding of duplicate filters in the sub-queries. This leads to user copying same query clause across sub-queries and a sub-optimal user experience.

Scope

Currently if we want to apply a filter to all the sub-queries of a Hybrid query we need to do it this way:
{
  "hybrid": {
    "queries": [
      {
        "neural": {
            "filter": <filter>,
            ...
        }
      },
      {
        "bool":{
          "must": [
            <sub-query2>
          ],
          "filter": [
            <filter>
          ]
        }
      },
      {
        "bool":{
          "must": [
            <sub-query3>
          ],
          "filter": [
            <filter>
          ]
        }
      }
    ]
  }
}

We are looking for a way to simplify the hybrid query so that we don’t need to add the duplicated common filter to each sub-query.

Out of the Scope

Metrics. Currently we haven’t set up the metric API for the neural plugin and we plan have a separate project for that so for this one we will not consider publishing the metrics for this new feature.

Configurability. We currently plan to implement a straightforward approach for applying a common filter to all sub-queries. If a sub-query already has its own filter, the common filter will be combined using AND logic. This aligns with the intuitive expectation that adding a common filter to a hybrid query should apply it to all inner queries. To move quickly and gather feedback, we will focus on this simple default behavior. However, in the future, we may explore adding configurability to support the following options:

  • Allow sub-query to decide when to use the common filter. Either during the sub-query or post it.
  • Allow sub-query to decide using AND, OR or IGNORE_IF_EXIST(ignore the common filter if the inner query already has its own filter) logic to combine the common filter with sub-query’s own filter.

No Filter Logic Change. Our plan is solely to add the common filter to the sub-queries without altering how the sub-queries handle their filter logic. For example, the common filter can be directly applied to sub-queries such as neural or knn queries, or the sub-queries can be wrapped in a bool query with the filter applied. However, the way the neural, knn, or bool queries process their respective filter logic will remain unchanged. This ensures that the existing behavior of these queries is preserved, maintaining consistency and avoiding unintended impacts.

No Further Filter Push Down. While the sub-query in a hybrid query can itself contain inner queries, we propose limiting the filter push-down to the top-level sub-query of the hybrid query. We do not intend to further push the filter down into the inner queries of the sub-query, as this is not a common use case. Our goal is to keep the solution straightforward and focused on typical scenarios. Supporting deeper filter push-down would add unnecessary complexity without significant practical benefits, so we have opted to avoid addressing it in this implementation.

Solution

High Level Design

We propose to add the common filter to the sub-queries to rewrite them when we parse the search request to the QueryBuilder in the fromXContent function of the HybridQueryBuilder.

In the OpenSearch when we receive a search query at high level we have below steps(some details are omitted):

  1. Route the request to the right handler.
  2. Parse the request to the QueryBuilder - Recommend to modify this step to push down the filter.
  3. Process the request if the SearchRequestProcessor is defined in the search pipeline.
  4. Convert the QueryBuilder to Lucene Query.
  5. Execute the Query.
  6. Process the Query result.
  7. Return the response.

KNNQueryBuilder has special logic to optimize the filter before being converted to the Lucene Query so we would recommend to add the common filter to sub-queries before the they are converted to the Lucene Query. Based on the above info we should do that work in the step 2 when we parse the request to the QueryBuilder or in the step 3 when we process the request through a SearchRequestProcessor.

Option 1: Push down the filter when we parse the request to the QueryBuilder.

In this option, we include the common filter to be pushed down as part of the hybrid query through the new filter field. The filter is then applied during the parsing process by adding the push-down operation to the fromXContent function of the HybridQueryBuilder, which is responsible for parsing the hybrid query.

The example query in the Scope section can be simplified as below:

{
  "hybrid": {
    "queries": [
      {
        "neural": {
          ...
        }
      },
      {
        <sub-query2>
      },
      {
        <sub-query3>
      }
    ],
    "filter": <filter>
  }
}

Why we propose the above filter data structure?

We propose using a single common filter that will be pushed down to inner queries. This approach will be thoroughly documented to ensure clarity regarding how the common filter is applied. In the "Out of Scope" section, we mentioned the possibility of supporting configurable options for combining the common filter with an inner query’s filter in the future. At that point, we can extend the single filter by introducing a new parameter, such as a "type" field, to control this behavior. Our goal is to start with a simple design and gradually introduce additional functionality.

Option 2: Add the common filter through a SearchRequestProcessor.

In this option we need to create a new SearchRequestProcessor similar to FilterQueryRequestProcessor to handle the logic to add a common filter to the sub-queries of a hybrid query. To use a hybrid query we have to define a search pipeline to define how to do normalization and combination. If we want to add a common filter to all the sub-queries of a hybrid query we can add it as a new request processor as below:

PUT /_search/pipeline/my_pipeline 
{
  "description": "Post processor for hybrid search",
  "phase_results_processors": [
    {
      "normalization-processor": {
        "normalization": {
          "technique": "min_max"
        },
        "combination": {
          "technique": "arithmetic_mean",
          "parameters": {
            "weights": [
              0.3,
              0.7
            ]
          }
        }
      }
    }
  ],
  "request_processors": [
    {
      "hybrid_query_common_sub_query_filter" : {
        "description" : "This processor is going to add a common filter to the 
        sub-queries of the hybrid query.",
        "filter" : <filter>
      }
    }
  ]
}
Options Pros Cons  
Option 1 (Recommended) 1. Less effort. This option just need to modify an existing class to add the filter push down logic. The option 2 needs to create a new SearchRequestProcessor and then add the similar logic to push the filter to sub-queries. So option 1 is easier to implement.

2. Easier to use. It's easier to add the filter to push down as part of the hybrid query compared to defining a new SearchRequestProcessor in the search pipeline.
   
Option 2 1. Reuse the filter. Once the filter is defined in the search pipeline we can easily reuse it especially when we set the search pipeline at the index level. No need to add the filter each time we do query. 1. More effort.

2. More difficult to use.
 

In summary we would recommend the option 1 because it’s easier to implement and also easier to use. And the option 2 can be a feature to support in future. Just like the FilterQueryRequestProcessor which also can be done by adding the filter to the query directly. But supporting it through the SearchRequestProcessor can make it a default filter for the index if we set the search pipeline as the default search pipeline.

Low Level Design

The Low Level Design is for the recommended option in High Level Design which is to Push down the filter when we parse the request to the QueryBuilder.

Option 1: Copy Hybrid Query Filter to Sub-Queries if Applicable (Recommended)

The sub-queries of the hybrid query can be any query type and most of them cannot support the filter field. For those query types we can use the boolean query to combine the query and the filter like what we are doing today . But there are also query types that can support the filter field(NeuralQueryBuilder, KNNQueyBuilder, HybridQueryBuilder, BoolQueryBuilder and ConstantScoreQueryBuilder).

The way to handle different query types:

NeuralQueryBuilder/KNNQueyBuilder
Currently NeuralQuery wraps the KNNQuery and the KNNQuery can support the filter field. The filter of the NerualQuery/KNNQuery will be applied during the query before returning the top k results so it will not drop the tail relevant results compared with using boolean query to apply a filter to a NerualQuery/KNNQuery. Besides hybrid query is heavily used with NerualQuery/KNNQuery and our customers expect to add the filter to NerualQuery/KNNQuery directly.

HybridQueryBuilder
The hybrid query should be the top query so it should not be wrapped in another hybrid query. It means for now no filter can be pushed down to a hybrid query if that happens we should through an error saying that’s not the supported behavior. Even though currently we have an issue to validate the query when a hybrid query is nested in another hybrid query our implementation should support the case that’s not a valid operation.

BoolQueryBuilder
Technically bool query doesn’t support the filter field it supports a field called filterClauses which is a list of the filters. Because the filter clauses of a bool query is applied after its inner queries which works kind like a post-filter so either we add the hybrid query filter to the filterClauses directly or we use another bool query to combine then we should get the same results. But it can more efficient if we directly add the hybrid query filter to the filterClauses since we save a bool query.

e.g. A hybrid query with a bool query can be like:


{
  "hybrid": {
    "queries": [
      {
        "bool": {
          "must": [
            <bool-sub-query-1>
          ],
          "filter": [
            <bool-filter-1>
          ]
        }
      },
      {
        <sub-query-1>
      }
    ],
   "filter": <filter>
  }
}

The above query should be converted to the below query when we parse the request to the QueryBuilder:

{
  "hybrid": {
    "queries": [
      {
        "bool": {
          "must": [
            <bool-sub-query-1>
          ],
          "filter": [
            <bool-filter-1>,
            <filter>
          ]
        }
      },
      {
        "bool": {
          "must": [
            <sub-query-1>
          ],
          "filter": [
            <filter>
          ]
        }
      }
    ]
  }
}

ConstantScoreQueryBuilder
A constant_score query wraps a filter query and assigns all documents in the results a relevance score equal to the value of the boost parameter. The filter query can be any query type so to combine the hybrid query filter with the constant score query filter we will use the filter function of the filter query to handle it.
e.g.

{
 "hybrid": {
   "queries": [
     "constant_score": {
        "filter": {
            "neural": {
                ...
            }
        },
        "boost": 1.2
    },
    {
      <sub-query-1>
    }
   ],
   "filter": <filter>
 }
}

The above query should be converted to the below query when we parse the request to the QueryBuilder:

{
 "hybrid": {
   "queries": [
     "constant_score": {
        "filter": {
            "neural": {
                "filter": <filter>
                ...
            }
        },
        "boost": 1.2
    },
    {
      "bool": {
        "must": [
          <sub-query-1>
        ],
        "filter": [
          <filter>
        ]
      }
    }
   ]
 }
}

Implementation Details

In HybridQueryBuilder we have fromXContent function which is used to parse a hybrid query and we can refactor it like below:

// First parse the filter to a filter query builder
else if (token == XContentParser.Token.START_OBJECT) {
   ...
    if (FILTER_FIELD.match(currentFieldName, parser.getDeprecationHandler())) {
        filter = parseInnerQueryBuilder(parser);
    }
   ...
}

// Combine the hybrid query filter with the sub-query as the modfied sub-queries of
// the HybridQuery
HybridQueryBuilder compoundQueryBuilder = new HybridQueryBuilder();
compoundQueryBuilder.queryName(queryName);
compoundQueryBuilder.boost(boost);
for (QueryBuilder subQuery : queries) {
    if(filter == null){
      compoundQueryBuilder.add(subQuery);
    }else{
      compoundQueryBuilder.add(subQuery.filter(filter, FilterCombinationMode.AND));
    }
}
return compoundQueryBuilder;

Here we recommend to support a new function for all the QueryBuilder to combine the query with the hybrid query filter. Currently we have the QueryBuilder as the interface which is implemented by the AbstractQueryBuilder. And all the query types are extending the AbstractQueryBuilder so we can build the common behavior in it. In this way we don’t need to modify the QueryBuilder of the query types that doesn’t need to support special filter push down operation.

This new function will take the filter and filterCombinationMode and return a QueryBuilder. Even now we only plan to support AND mode we want to define the interface in a scalable way so that if we want to support more modes we can easily extend it.

public QueryBuilder filter(QueryBuilder filter, FilterCombinationMode filterCombinationMode);

public enum FilterCombinationMode {
    // We only plan to support AND mode as the default behavior
    AND, // Combine the new filter with the existing one using AND logic.

    
    // Below potentially can be the modes we can support in future
    IGNORE_IF_EXISTS, // Ignore the new filter if the query already has an existing filter.
    OR              // Combine the new filter with the existing one using OR logic.
}

Why use AND as the default behavior?

This aligns with the intuitive expectation that adding a common filter to a hybrid query should apply it to all inner queries. Besides if we do want to use other ways to combine the filter we can directly add the filter to the inner queries without using this common filter feature.

// In AbstractQueryBuilder we will implement the method this way
public QueryBuilder filter(QueryBuilder filter, FilterCombinationMode filterCombinationMode){
  if(filter == null){
    return this.
  }else{
   if(FilterCombinationMode.AND.equals(filterCombinationMode) | filterCombinationMode == null){
     final BoolQueryBuilder modifiedQB = new BoolQueryBuilder();
     modifiedQB.must(this);
     modifiedQB.filter(filter);
     return modifiedQB;
   }
  }
}

For NeuralQueryBuilder, KNNQueyBuilder, HybridQueryBuilder, BoolQueryBuilder and ConstantScoreQueryBuilder we can override the filter to implement its own behavior accordingly.

And for NeuralQueryBuilder, KNNQueyBuilder even they are sharing the common behavior they belong to two different plugins. And the filter push down logic is pretty simply so would recommend to implement the logic in each own QueryBuilder even there can be some code duplication.

Class Hierarchy

Image

Option 2: Only Use Bool Query to Combine the Hybrid Query Filter and Sub-Queries

Compared to the option 1 the option 2 propose to simply use a bool query to combine the hybrid query filter with the sub-queries for all query types. So we only need to refactor the fromXContent function of the HybridQueryBuilder.

Option 3: Post filtering

OpenSearch already supports post_filter which can apply the filter after the query. But it will happen after the hybrid query normalization and combination. Besides it also cannot support knn and neural query well since we need to push the filter into them to avoid dropping the tail relevant results.

Summary

Options Pros Cons Note
Option 1 (recommended) 1. Better performance. Copying the hybrid query filter to the sub-queries that can support the filter field can have a better performance. e.g. If the sub-query is a bool query we don't need another bool query to wrap it if we simply add the filter to its filter clauses.

2. Better accuracy. For the NeuralQuery and KNNQuery pushing down the filter into the query can avoid dropping the tail relevant results. And this is normally what customers want.
1. Implementation Complexity. Compared to the option 2 this option is more complicated and can take more effort to implement it.  
Opiont 2 1. Easy implementation. 1. Worse accuracy. Using a bool query to combine the hybrid query filter with the sub-queries can have a worse accuracy compared to the option 1. Especially for the NueralQuery and KNNQuery that we potentially can drop the tail relevant results.  
Option 3 1. No effort needed. Already supported. 1. Worse accuracy. For neural query and knn query it may drop the tail relevant results and it's not the behavior normally customers want.  

We would recommend the option 1 since it’s more aligned with what customers want and the LOE is acceptable even it’s more complicated than the option 2 and 3.

Discuss Points

Don’t Support Filter on Score as Common Inner Query Filter

The hybrid query includes a normalization process that adjusts the scores of results after the sub-queries execute. If customers use a score-based filter as the common inner query filter, they might reasonably expect the filter to apply after normalization. However, the behavior we plan to implement involves pushing the filter down to each sub-query, where it will apply before normalization. This discrepancy could lead to results that differ significantly from customer expectations. Even with thorough documentation, it is likely that customers could raise concerns or issues about this behavior. To mitigate this, we propose blocking the use of score-based filters as common inner query filters. This restriction can be enforced by adding validation logic to the fromXContent function during request parsing.

Performance Analysis

There should be almost no performance delta.

There should be negligible performance impact since our solution only rewrites sub-queries during the request parsing phase, leaving the remaining search process unchanged. The rewrite simply transforms the simplified hybrid query into a format already supported and commonly used by customers. As a result, the performance of the simplified hybrid query should be nearly identical to that of the existing, more complex query.

Testability

  • Integration tests that covers the scenarios mentioned below
    • Hybrid query with a filter and multiple sub-queries should work properly.
    • The sub-queries should contain the query types using common filter push down logic.
    • The sub-queries should contain the query types using special filter push down logic including bool query, knn query, neural query and constant score query.
  • BWC tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants