Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Derived Source for Vectors #2377

Open
jmazanec15 opened this issue Jan 9, 2025 · 8 comments
Open

[RFC] Derived Source for Vectors #2377

jmazanec15 opened this issue Jan 9, 2025 · 8 comments
Labels
RFC Request for comments v2.19.0

Comments

@jmazanec15
Copy link
Member

Introduction

This is an RFC that presents a proposal for removing knn_vector from "_source" field without loss of OpenSearch functionality that "_source" enables. "_source" in this context refers to the per document field in OpenSearch that stores the original source provided by the user as a StoredField in lucene. See SourceFieldMapper for more details.

This is a followup for #1571 and #1572.

Problem

Currently, vectors for native indices are stored in 3 places by default

  1. _source stored field. Vectors along with the reset of json body of the document are stored (i.e. .fdt)
  2. Native library files — ANN structure and vectors are stored (i.e. .hnsw)
  3. FlatVectorsFormat format — Basically doc values for vectors (i.e. .vec)

In an experiment with 10k 128-dimensional vectors, the size break down of these files was:

Total Index Size 24.3 mb
HNSW files 5.91 mb
Doc values 3.8 mb
Source 14.6 mb

With BEST_COMPRESSION codec:

Total Index Size 18.3 mb
HNSW files 5.91 mb
Doc values 3.75 mb
Source 8.64 mb

From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: opensearch-project/OpenSearch#6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.

For a typical user, they should not need to get the source vector from OpenSearch. Thus, storing the vectors in _source poses significant problems for users with minimal benefits:

  1. Users have to pay to store data they do not really need or use. This issue gets even more pronounced for disk-based vector search, where memory is no longer the bottleneck. Users end up having to provision their clusters based on storage capacity.
  2. Vectors in _source eat up serialization/deserialization bandwidth. Whenever the _source field needs to be serialized or deserialized (i.e. written to disk, shards migration, snapshot, etc.) a major portion of the bandwidth of this channel is consumed by the vectors in the _source themselves. This can affect all different areas of a users’s vector search workload, such as indexing throughput, search speed, page cache utilization, shard migration, etc. Again, this gets worse with disk-based vector search, where all resources are much more scarce.

Because of this, we generally recommend to users that they disable storing the vectors in the source. However, this has serious limitations:

  1. They will not be able to reindex the data
  2. Update and update by query API does not work
  3. Requires understanding a lot of concepts which leads to poor OOB experience

So, enter “derived_source”. We take inspiration from “derived fields” feature of OpenSearch to use one format of data for another purpose on the fly. The idea is that we already have the vectors available via the FlatVectorsFormat files (.vec). When we need to read the _source, we should just inject the vector fields into the _source field from the FlatVectorsFormat file. The effect will be that all functionality of OpenSearch works and we get a potential > 50% reduction in storage space for vectors.

Proposed Solutions

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

Because the KNN plugin already implements its own Codec, we can override the StoredFieldsFormat to intercept and inject the vector fields when needed. This format would use the delegate pattern (as the k-NN plugin already does with core codecs) and only intervene with respect to accesses on the _source stored field on read and write (see PoC).

Pros

  1. Great out of box experience! User would not need to provide any special configuration in order to get this benefit. On search, they would still need to manually exclude the vector fields, but this is consistent with the existing OpenSearch behavior.
  2. Robust feature support. Because we are modifying the _source at a very low level, we can be confident that features that require _source built on top of this would work without any issues. The _source injection would be totally transparent

Cons

  1. Unable to access OpenSearch resources — To implement this option, we would extend our existing codec. The codec abstraction is at the Lucene level. With this, it is difficult to get some of the required OpenSearch dependencies we would need. For instance, for nested fields, in order to get the parent/child filters, we would need to either directly use the FieldsFormat/PostingFormat (as was done in the PoC) or somehow create a searcher. It is unclear exactly what limitations we will hit here
  2. Coupling of different Format readers feels like an anti-pattern. Having the StoredFieldsReader rely on KNNVectorsReader creates a dependency chain between the 2. With this it opens up the door to a circular dependency in the future (although no concrete situations come to mind)

For this option, we created a PoC to showcase feasibility. The PoC was able to support the following features:

  1. [Flat vector mappings] Injecting vectors into source
  2. [Flat vector mappings] Reindexing
  3. [Flat vector mappings] Update by query
  4. [Nested] Injecting vectors into source for single nested mapping without deletes
  5. [Nested] Reindexing
  6. [Nested] Update by query

[Option # 2] Introduce a dedicated FetchSubPhase to inject vector into source

As an alternative, as was done in #1572 by @luyuncheng, we can also create a custom FetchSubPhases in order to prepare the payload with the injected source that we can return to the caller. Generally, this will be where _source gets read (but not guaranteed to be so).

The general workflow for users would be:

  1. Create an index with the vector fields explicitly excluded from source
  2. On search/get, the DerivedVectorSearchFetchSubPhase would intercept the SearchResponse (without the vectors) and add the excluded vector fields back into SearchResponse

This approach has the following pros/cons:

Pros

  1. Easy access to required OpenSearch resources — _source is an OpenSearch concept - Lucene just sees it as a stored field. Thus, most of the configuration details around it are stored in the OpenSearch layer (as opposed to Lucene) — e.g. MappedFieldTypes. Implementing at the FetchSubphase gives us access to these required resources. This also makes it easier to handle other OpenSearch specific cases (such as nested fields)

Cons

  1. FetchSubphase from plugin would execute after all core FetchSubphases. Thus, the core FetchSubphases would not have access to the vector source. There are not any explicit use cases I can think of here where they need it, but if a user comes up with a case, this would be a hard limitation.
  2. Non-deterministic ordering of plugin based fetch-subphases — OpenSearch will execute FetchSubPhases sequentially. OpenSearch will control ordering of the FetchSubPhases that plugins add. Thus, if another plugin adds a FetchSubPhase, it is not clear whether source will be present or not for them to use
  3. The overall experience is inconsistent with existing OpenSearch experience. A user would need to exclude the vector fields from source, but still get them in the search response.

[Option # 3] Implement Custom StoredFieldVisitor

The security plugin has a feature called “Field-level security” where admins can limit access to different users at the field level. This feature requires that they automatically filter or mask privileged fields from _source. This is similar to what we want to do for vectors! They do this by implementing a custom StoredFieldsVisitor, FlsStoredFieldsVisitor. The StoredFieldsVisitor will be called in the StoredFieldsReader, for a given document and a given field. Thus, their visitor has the option to intercept the “_source” field, and filter/mask the fields they want. They use the “onIndexModule” extension point in order to inject this via a custom readerWrapper.

We could do something similar for vector derived source, where instead of filtering and masking, we inject the vector fields.

Pros

  1. Somewhat easy access to required OpenSearch resources — we have everything on OpenSearch side because extension point is onIndexModule
  2. Closer than Option [Plugin migration] Update upstream #1 to actual _source field retrieval, which will mean that more features will be supported out of the box

Cons

  1. Incompatible with security plugin — indexModule.setReaderWrapper can only be called once. Thus, as it stands now, security and knn derived source would not work together.
  2. Inconsistent user experience — A user will still need to exclude the vector fields from source, but still get them in the search response.

Summary

We are proposing option 1 because it provides a consistent UX with existing OpenSearch UX and extends a low level enough point to be generally robust.

Proposed User Experience

The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
      "knn.derived_source.enabled": true/false # default to tru 
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'

Open Questions

Avoid reconstruction of vectors on searches that later filter it out

In the current PoC, if someone excludes a field like this, in the StoredFieldsReader, we will inject the vector into the document and it will be later filtered out by OpenSearch logic. Instead of this, we need to figure out a way where we skip reconstruction in the first place if the field is going to be excluded anyway. This is a bit tricky to do and may involve a change in core. One idea is to pass this information in the FieldsVisitor and do some kind of type casting to get the information in the StoredFieldsReader component.

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'

Next Steps

  1. Publish high level design
  2. Create PoC/Proposal on core on solving redundant reconstruction of vector issue
  3. Publish low level design
@navneet1v
Copy link
Collaborator

From this, we can see that 47%-60% of the storage is going towards the _source storage. For more details, see: opensearch-project/OpenSearch#6356. Worse yet, for our disk based feature with our quantized vectors (vectors that get compressed in a lossy fashion), the native lib files will get smaller than the FlatVectorsFormat file, so the _source will take up an even larger percentage of the storage.

I think when a model give fp32 then this %age will be more. I think with 128D the number of characters are pretty low if we compare with something like cohere datasets. I have seen this %age going to 80% too.

The user interface will either have a cluster setting that will indicate whether or not to use the derived_source feature.

Any reason why we cannot enable it by default? I think we should enable it by default. WDYT?

@jmazanec15 one more benefit of removing the vector field from source is speedup in the force merge. I was running some experiments, where I saw if we don't have vector in the _source there a good visible speedup in the force merge of vector indices.

// On search, my_vector1 is excluded
POST some_index/_search
{
_source : {
"excludes": ["my_vector1"]
}
...
}'

I didn't understand why user need to do this? Because I was thinking we will just exclude the vector field while creating the index.

@jmazanec15
Copy link
Member Author

I think when a model give fp32 then this %age will be more. I think with 128D the number of characters are pretty low if we compare with something like cohere datasets. I have seen this %age going to 80% too.

That makes sense. 80% wouldnt surprise me too much

Any reason why we cannot enable it by default? I think we should enable it by default. WDYT?

Right - this will default to true - but there will be a setting to disable it. One reason to disable it may be fore users who are pulling vectors from OpenSearch as a vector store. It may be slower with this respect.

@jmazanec15 one more benefit of removing the vector field from source is speedup in the force merge. I was running some experiments, where I saw if we don't have vector in the _source there a good visible speedup in the force merge of vector indices.

Oh nice - yes I think there will be a lot of kind of side effect benefits from this.

I didn't understand why user need to do this? Because I was thinking we will just exclude the vector field while creating the index.

They are not exluding the vector field when creating the index. It will actually be full transparent. Thus, on search, if they do not exclude the field, it will be returned (like it is today). This keeps experience consistent. If we wanted to exclude vector fields by default, this could be taken up separately.

@navneet1v
Copy link
Collaborator

It will actually be full transparent.

can you please elaborate more on this?

They are not exluding the vector field when creating the index. It will actually be full transparent. Thus, on search, if they do not exclude the field, it will be returned (like it is today). This keeps experience consistent. If we wanted to exclude vector fields by default, this could be taken up separately.

Sorry I am little confused on this part. Let me try to ask the question again. Are we suggesting customer to exclude vector fields during index mapping or not?

@jmazanec15
Copy link
Member Author

Sure, the experience is meant to be as transparent as possible. By this, the intention is that customer will have the same user experience as they had without derived source as they have with derived source. In other words, derived source will not require any kind of change in user behavior.

When they create an index like

PUT my-knn-index-1
{
  "settings": {
    "index": {
      "knn": true
      "knn.derived_source.enabled": true # note - this defaults to true, but just including it to highlight derived source is enabled.
    }
  },
  "mappings": {
    "properties": {
      "my_vector1": {
        "type": "knn_vector",
        "dimension": 2
      }
    }
  }
}

they will interact in the same way with the index as if they had not enabled derived source. Thus, the following query would be expected to return the source vector:

// On search, my_vector1 is excluded
POST some_index/_search
{
       ...
}'

So, if they do not want the vector field, as they do with normal vector indices, they would have to specify source exclusion

// On search, my_vector1 is excluded
POST some_index/_search
{
       _source : {
           "excludes": ["my_vector1"]
       }
       ...
}'

@navneet1v
Copy link
Collaborator

@jmazanec15
Got it make sense. I understand why you are saying this. I remember in the FetchSubphase approach this experience was not consistent.

@luyuncheng
Copy link
Collaborator

[Option # 1] (Preferred) Implement Custom StoredFieldsFormat in existing KNNCodec

@jmazanec15 i like this [option 1], and also when i see your POC code, i think it is pretty good.

@luyuncheng
Copy link
Collaborator

luyuncheng commented Jan 21, 2025

@jmazanec15 @navneet1v is there any chance that we can implement custom codec format which containsCustom StoredFieldsFormat AND Custom Native DocValuesFormat like #1571 and #2267. and we can use only one binary vector. reduce storefield and lucene binarydocvalues.

@jmazanec15
Copy link
Member Author

@luyuncheng Yes, I think we should do the Custom Native DocValuesFormat as well for full precision vectors. It seems like it would save some space and have some other utilities as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments v2.19.0
Projects
Status: New
Status: 2.19.0
Development

No branches or pull requests

3 participants