Skip to content

Commit

Permalink
docs: zebra::database::default module
Browse files Browse the repository at this point in the history
  • Loading branch information
emmyoh committed Jan 2, 2025
1 parent c06017c commit 572ddfb
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 4 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/rust.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
- name: Checkout codebase
uses: actions/checkout@v4
- name: Generate documentation
run: time cargo doc --no-deps -Zrustdoc-map --release --quiet
run: time cargo doc --features="default_db" --no-deps -Zrustdoc-map --release --quiet
- name: Fix permissions
run: |
chmod -c -R +rX "target/doc/" | while read line; do
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ Oku enables distributed storage & distribution of mutable user-generated data; t
* Capable of scaling in dataset size **without excessively impacting memory usage for an individual node**
* Capable of acceptable recall with multiple distance metrics

Despite the common need---a scalable CRUD database---existing solutions often fell short.
Despite the common needa scalable CRUD databaseexisting solutions often fell short.

#### Distribution & CRUD
Many vector databases utilise the [hierarchical navigable small world (HNSW)](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) algorithm to construct their database indices, as it (a) achieves high recall on high-dimensional data regardless of distance metric, and (b) performs fast queries regardless of dataset size. However, despite its attractiveness on benchmarks, it can be impractical to use in many production contexts as (a) it's difficult to distribute as you cannot shard the index across multiple nodes, (b) the entire index must be loaded into memory to perform operations, making memory a bottleneck in addition to storage, and (c) deleting vectors essentially requires rebuilding the entire index from scratch and re-inserting every vector except the deleted ones; using redundant indices and tombstoning is the only way to keep the database online.

The need for a scalable & mutable vector database is not new, however, and the problem has apparently been solved to an acceptable degree before---content recommendation systems based on embedding vectors have been in production for many years, and they've often used some variation of [locality sensitive hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to build a vector database index. An LSH index is not graph-based, but instead breaks up an *f*-dimensional space into regions of similar vectors. Consequently, it can be (a) sharded, (b) accessed in parallel, and (c) accessed from storage because, unlike a graph such as HNSW, it is not '[object soup](https://jacko.io/object_soup.html)' and avoids issues with cache locality and synchronisation in multithreaded contexts. LSH's advantages in performance and resource usage does come with an implication: while HNSW approximates neighbours, LSH approximates similarities. The recall of LSH is lesser as it's less concerned with finding *the nearest* neighbours, and more concerned with just finding what *is near*. For fine-grained searches, LSH is less helpful, but for large & varied datasets where it's important to find records that are 'close enough', it has significant advantages.
The need for a scalable & mutable vector database is not new, however, and the problem has apparently been solved to an acceptable degree before—content recommendation systems based on embedding vectors have been in production for many years, and they've often used some variation of [locality sensitive hashing (LSH)](https://en.wikipedia.org/wiki/Locality-sensitive_hashing) to build a vector database index. An LSH index is not graph-based, but instead breaks up an *f*-dimensional space into regions of similar vectors. Consequently, it can be (a) sharded, (b) accessed in parallel, and (c) accessed from storage because, unlike a graph such as HNSW, it is not '[object soup](https://jacko.io/object_soup.html)' and avoids issues with cache locality and synchronisation in multithreaded contexts. LSH's advantages in performance and resource usage does come with an implication: while HNSW approximates neighbours, LSH approximates similarities. The recall of LSH is lesser as it's less concerned with finding *the nearest* neighbours, and more concerned with just finding what *is near*. For fine-grained searches, LSH is less helpful, but for large & varied datasets where it's important to find records that are 'close enough', it has significant advantages.

#### Integrity & Safety
To avoid excessive memory usage, some have saved indexes to storage and performed operations directly on the index files as if they were in memory, taking advantage of a technique called [memory mapping (`mmap`)](https://en.wikipedia.org/wiki/Memory-mapped_file). Spotify boasts of [its LSH index](https://github.com/spotify/annoy):
Expand All @@ -41,7 +41,7 @@ Cloudflare [makes similar bold claims](https://blog.cloudflare.com/scalable-mach
>
> In the wake of our redesign, we've constructed a powerful and efficient system that truly embodies the essence of 'bliss'. Harnessing the advantages of memory-mapped files, wait-free synchronization, allocation-free operations, and zero-copy deserialization, we've established a robust infrastructure that maintains peak performance while achieving remarkable reductions in latency.
Multiprocess concurrency with `mmap` is arguably *impossible*---countless DBMSes have learned the same lesson after many years. There is [a paper on this subject](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf) that covers the pitfalls of `mmap` in detail; suffice it to say, a memory-mapped database index has not demonstrably achieved the data integrity & memory-safety guarantees necessary for a production database.
Multiprocess concurrency with `mmap` is arguably *impossible*countless DBMSes have learned the same lesson after many years. There is [a paper on this subject](https://db.cs.cmu.edu/papers/2022/cidr2022-p13-crotty.pdf) that covers the pitfalls of `mmap` in detail; suffice it to say, a memory-mapped database index has not demonstrably achieved the data integrity & memory-safety guarantees necessary for a production database.

#### Potential Improvements
This software is free & open-source (FOSS), and code contributions are welcome.
Expand Down

0 comments on commit 572ddfb

Please sign in to comment.